OpenTSDB at Box
#HBaseCon2013
Jonathan Creasy
Geoffrey Anderson @geodbz
Jonathan Creasy
• SysAdmin @ Box, Inc.
• Hadoop for Analytics
Geoffrey Anderson
• DBA @ Box, Inc.
• Tooling for MySQL and HBase
• #DBHangOps
The
Situation
•Storing
•RRDs
•Flat files
•Pre-defined
•Graphs
•Data to collect
•Poll model
These are problematic because...
Enter OpenTSDB
OpenTSDB is...
• Distributed
• Scalable
• Time Series Database
• Runs on HBase
• Created By
Benoit Sigoure
HBase
TSD for
Q...
• FAST
• EASY to Scale
• EASY to Populate
• EASY to collect data
• EASY to Query
Why OpenTSDB?
Collecting
Data
#!/usr/bin/env bash
timestamp=$(date +%s)
mysql -ss -e "SHOW GLOBAL STATUS" | while read var val
do
echo "mysql.$var $time...
#!/usr/bin/env bash
timestamp=$(date +%s)
mysql -ss -e "SHOW GLOBAL STATUS" | while read var val
do
echo "mysql.$var $time...
* * * * * mysql_collector.sh | nc opentsdb.example.com 4242
Example: adding a cron for OpenTSDB
ganderson@mydb.example.com:tcollector$ tree
.
|-- collectors
| |-- 0
| | |-- ifstat.py
| | |-- iostat.py
| | |-- procnettc...
Querying
Data
http://opentsdb.example.com
/#start=2013/06/05-17:00:00
&end=2013/06/05-19:00:00
&m=sum:hadoop.hbase.regionserver.requests...
http://opentsdb.example.com
/q?start=2013/06/05-17:00:00
&end=2013/06/05-19:00:00
&m=sum:hadoop.hbase.regionserver.request...
How does this change things?
In all seriousness, though...
• Easily see aggregate graphs
• Easily build graphs on-the-fly
• Full granularity forever
• ...
Challenges Switching
• Aggregates are the default
• Mouse-zooming (patched!)
• Auto-suggest for metrics
• “The graphs aren...
Some
Quick
Numbers
OpenTSDB @ Box
• 24,448 metrics
• 79 tag keys
• 5,371,701 tag values
• 150,000 data points per second
To store metric data for
anything
that is
measurable
Collection Philosophy
Next Steps
Enjoy #Hbasecon2013!
https://www.box.com/about-us/careers/
jcreasy@box.com
geoff@box.com
We’re Hiring!
Image credits
• http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG
• http://ww...
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
Upcoming SlideShare
Loading in...5
×

HBaseCon 2013: OpenTSDB at Box

4,700

Published on

Presented by: Jonathan Creasy (Box) and Geoffrey Anderson (Box)

Published in: Technology, Business
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,700
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide
  • Will be talking about OpenTSDBHow OpenTSDB changed monitoring at box
  • Running gangliaGet pushed metricsHave to define aggregatesRRD format
  • Cacti has an easy centralized interfaceLots of templates accessibleUses polling model
  • Graphite!Get pushedmetricsfromvarious servicesNeed to define the graphs youwantNeed to define aggregations
  • RRDs auto-downsampleFlat files can be hard to manage at scalePre-definedNeed to know what you want to monitorPainful to setup new collections/graphsPollDoesn’t scale horizontally wellFalls behind and data gets droppedOccasionally drop important metrics to make it catch uphttp://monami.sourceforge.net/gfx/ganglia.pnghttp://www.cacti.net/images/cacti_promo_main.pnghttp://graphite.wdfiles.com/local--files/screen-shots/graphite_cli_800.pnghttp://nagios.sourceforge.net/images/screens/new/service-detail.png
  • Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
  • IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
  • If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
  • Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added
  • New metrics available with every code push
  • Transcript of "HBaseCon 2013: OpenTSDB at Box"

    1. 1. OpenTSDB at Box #HBaseCon2013 Jonathan Creasy Geoffrey Anderson @geodbz
    2. 2. Jonathan Creasy • SysAdmin @ Box, Inc. • Hadoop for Analytics
    3. 3. Geoffrey Anderson • DBA @ Box, Inc. • Tooling for MySQL and HBase • #DBHangOps
    4. 4. The Situation
    5. 5. •Storing •RRDs •Flat files •Pre-defined •Graphs •Data to collect •Poll model These are problematic because...
    6. 6. Enter OpenTSDB
    7. 7. OpenTSDB is... • Distributed • Scalable • Time Series Database • Runs on HBase • Created By Benoit Sigoure HBase TSD for Querying mydb.example.com HAProxy fe1.example.com TSD for Storing Push Metrics Query via API
    8. 8. • FAST • EASY to Scale • EASY to Populate • EASY to collect data • EASY to Query Why OpenTSDB?
    9. 9. Collecting Data
    10. 10. #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh
    11. 11. #!/usr/bin/env bash timestamp=$(date +%s) mysql -ss -e "SHOW GLOBAL STATUS" | while read var val do echo "mysql.$var $timestamp $val host=$HOSTNAME" done ganderson@mydb.example.com:~$ _./mysql_collector.sh mysql.Aborted_connects 1366399993 0 host=mydb.example.com mysql.Binlog_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_cache_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_disk_use 1366399993 0 host=mydb.example.com mysql.Binlog_stmt_cache_use 1366399993 0 host=mydb.example.com mysql.Bytes_received 1366399993 19453687 host=mydb.example.com mysql.Bytes_sent 1366399993 1238166682 host=mydb.example.com mysql.Com_admin_commands 1366399993 1 host=mydb.example.com mysql.Com_assign_to_keycache 1366399993 0 host=mydb.example.com ... Example: mysql_collector.sh Metric name Timestamp Value “Tags” (key=val)
    12. 12. * * * * * mysql_collector.sh | nc opentsdb.example.com 4242 Example: adding a cron for OpenTSDB
    13. 13. ganderson@mydb.example.com:tcollector$ tree . |-- collectors | |-- 0 | | |-- ifstat.py | | |-- iostat.py | | |-- procnettcp.py | | |-- procstats.py | |-- 15 | | `-- dfstat.py | |-- 30 | | |-- mysql_collector.sh | |-- 300 | | `-- ptTcpModel.sh | `-- etc | |-- config.py |-- config |-- startstop `-- tcollector.py Run forever Run every 15 seconds Run every 5 minutes Run every 30 seconds
    14. 14. Querying Data
    15. 15. http://opentsdb.example.com /#start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600
    16. 16. http://opentsdb.example.com /q?start=2013/06/05-17:00:00 &end=2013/06/05-19:00:00 &m=sum:hadoop.hbase.regionserver.requests {server_type=dwh-data} &o=axis x1y1 &m=sum:proc.stat.cpu.percentage_iowait {server_type=dwh-data,dc=lv7,host=data08} &o=axis x1y2 &ylabel=HBase Requests &y2label=&CPU IOWait &yrange=[0:] &wxh=1475x600 &ascii
    17. 17. How does this change things?
    18. 18. In all seriousness, though... • Easily see aggregate graphs • Easily build graphs on-the-fly • Full granularity forever • API request for raw data • Cluster-wide nagios checks with check_tsd
    19. 19. Challenges Switching • Aggregates are the default • Mouse-zooming (patched!) • Auto-suggest for metrics • “The graphs aren’t pretty” • Migrating from proof of concept • Plan for 5+ machines • Data pruning may be required
    20. 20. Some Quick Numbers OpenTSDB @ Box • 24,448 metrics • 79 tag keys • 5,371,701 tag values • 150,000 data points per second
    21. 21. To store metric data for anything that is measurable Collection Philosophy
    22. 22. Next Steps
    23. 23. Enjoy #Hbasecon2013! https://www.box.com/about-us/careers/ jcreasy@box.com geoff@box.com We’re Hiring!
    24. 24. Image credits • http://upload.wikimedia.org/wikipedia/commons/7/7b/Batelco_Network_Operations_Centre_(NOC).JPG • http://www.flickr.com/photos/hoyvinmayvin/5873697252/ • http://www.percona.com/doc/percona-monitoring-plugins • http://www.2cto.com/uploadfile/2012/0731/20120731112415744.jpg • http://media.tumblr.com/tumblr_lvfspoenWU1qi19a2.png • http://img.izismile.com/img/img4/20110527/640/you_can_be_a_superhero_640_01.jpg • http://openclipart.org/image/250px/svg_to_png/26427/Anonymous_notebook.png • http://images.alphacoders.com/768/2560-1600-76893.jpg • http://www.flickr.com/photos/in365/4861180503/ • http://openclipart.org/image/250px/svg_to_png/130915/Prohibido_3D.png • http://www.flickr.com/photos/61114149@N02/5566484951/ • http://opentsdb.net/img/tsd-sample.png • http://images2.wikia.nocookie.net/__cb20080911160202/bttf/images/5/57/WhatdidItellyou-HQ.jpg • http://www.flickr.com/photos/lisakayaks/3028350539/ • http://www.flickr.com/photos/25566302@N00/1472400115 • http://www.flickr.com/photos/grandmaitre/5846058698/ • http://www.flickr.com/photos/7518432@N06/2673347604/

    ×