OpenTSDB is...
• Distributed
• Scalable
• Time Series Database
• Runs on HBase
• Created By
Benoit Sigoure
HBase
TSD for
Querying
mydb.example.com
HAProxy
fe1.example.com
TSD for
Storing
Push
Metrics
Query via API
• FAST
• EASY to Scale
• EASY to Populate
• EASY to collect data
• EASY to Query
Why OpenTSDB?
In all seriousness, though...
• Easily see aggregate graphs
• Easily build graphs on-the-fly
• Full granularity forever
• API request for raw data
• Cluster-wide nagios checks with check_tsd
Challenges Switching
• Aggregates are the default
• Mouse-zooming (patched!)
• Auto-suggest for metrics
• “The graphs aren’t pretty”
• Migrating from proof of concept
• Plan for 5+ machines
• Data pruning may be required
Will be talking about OpenTSDBHow OpenTSDB changed monitoring at box
Running gangliaGet pushed metricsHave to define aggregatesRRD format
Cacti has an easy centralized interfaceLots of templates accessibleUses polling model
Graphite!Get pushedmetricsfromvarious servicesNeed to define the graphs youwantNeed to define aggregations
RRDs auto-downsampleFlat files can be hard to manage at scalePre-definedNeed to know what you want to monitorPainful to setup new collections/graphsPollDoesn’t scale horizontally wellFalls behind and data gets droppedOccasionally drop important metrics to make it catch uphttp://monami.sourceforge.net/gfx/ganglia.pnghttp://www.cacti.net/images/cacti_promo_main.pnghttp://graphite.wdfiles.com/local--files/screen-shots/graphite_cli_800.pnghttp://nagios.sourceforge.net/images/screens/new/service-detail.png
Suddenly finding problems and correlating issues is difficultMaybe you don’t have a NOC yetMaybe you do, and they need better graphs
IT’S BIGGER ON THE INSIDE – just kiddingFast!Easy to build graphs on the flyHella easy to scale – just add nodes (HBase or TSDs)Very easy to put data into it – NEXT SLIDES TALK ABOUT THIS YO
Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
Running threads follows the CPU spikes PERFECTLYBox has a “long query” killer that gets more aggressive as more threads stack upShould get a look at queries on the server
If you prefer text, that’s also an option via APIYou can build cool tools using the APIWeek over Week graphsSimplifies anomaly detectionURL is pretty simpleEffectively just use “q?” and add “&ascii”
Aggregatesare thedefault–shift in thinking from lookingatspecificimportantservers.Zooming in on a timeslice was painfullymanual– I wroteup a patch to addmouse-zooming and upstreamed. Thiscementedopentsdb as a powerful monitoring tool for Box, overnightAuto-suggest for metricsisspotty– we wrote a quick cron job that dumps full metric list into JSON “Graphs aren’t pretty” – a few changes to the base GNUPlot options solved this. There’s also a “Smooth” option in the interface nowMigrating from POC – we had a single-node setup for the longest time until that fell over...a lotPlan for 3+ machines – it’s enough to run all the needed bits for a light-weight distributed HBase and TSD setupData pruning – ~4 bytes per metric before HDFS replication add up quicklymysql_tcollector - 370 metrics -- ~1.5k per server. X 30s interval = ~4.2MB/dayeither have a plan to prune old data or build out extra capacity and predict storage needs per server/metric added