Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analytics driven operations - Steve Acreman - Dataloop

240 views

Published on

(From the LondonCD meetup on 20 Oct 2016 - http://www.meetup.com/London-Continuous-Delivery/events/231766686/)

Modern infrastructure is becoming increasingly more complex and harder to operate. Trends like containerisation, micro-services and serverless architectures are making it difficult to work out what exactly is happening when problems occur. Most companies are building large distributed systems that were unthinkable only a few years ago. This talk will explain how an analytics monitoring stack will put developers and operations back in the driving seat and given them control back over their uptime.

Published in: Software
  • Login to see the comments

Analytics driven operations - Steve Acreman - Dataloop

  1. 1. www.dataloop.io | @dataloopio | info@dataloop.io Monitoring for Online Services
  2. 2. What is Dataloop? PerformanceUp / Down Alerts Dev Env Enterprise Stuff
  3. 3. Architecture
  4. 4. First Year
  5. 5. First Year
  6. 6. Measure
  7. 7. Putting out the fire rollup workermetric worker
  8. 8. Problems • NodeJS metrics workers not scaling • Memory management was an issue • Needed big caches to reduce database load • GC cycles too long • 8 x single processes on an 8 core server
  9. 9. Metric worker re-write • Approximately 6 weeks from no Erlang experience to working version • No more crashes • Reduced servers needed from 16 to 8 • Pushes metrics straight from Rabbit into DalmatinerDB (new database)
  10. 10. Today
  11. 11. Happy Ending
  12. 12. Just the beginning!
  13. 13. Initial Instrumentation › StatsD libraries in Node and Erlang code › Push UDP packets to a StatsD server for aggregation
  14. 14. Pitfalls › Metrics increase as service usage increases › UDP isn’t great › Aggregates across a service (hard to spot an outlier) › Quite lossy
  15. 15. Better Instrumentation › Prometheus http metrics endpoints › 10 second scrape interval into Dataloop › Raw data (no loss) › Dimensions allow drill down into host
  16. 16. Prometheus Output curl http://localhost/metrics
  17. 17. What to instrument? › Everything! › Feature usage › Throughput › Error rates › If it moves instrument it
  18. 18. Analytics › Simple things like API response times
  19. 19. Analytics › Pretty useful to plot when a problem started
  20. 20. Yesterday vs. Today
  21. 21. SQL Like Query Language
  22. 22. Time Series Functions › Create a query to answer questions
  23. 23. Future › Prediction algorithms › Search ‘similar’ metrics › Outlier algorithms › More functions!
  24. 24. Summary › Code level metrics with Prometheus are extremely light weight › Have a framework in place to quickly add more when issues arise › Don’t wait until your first fire to start › Start small and try to get both operations and developers on board
  25. 25. Q&A
  26. 26. www.dataloop.io @dataloopio

×