Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring patterns for mitigating technical risk

1,886 views

Published on

How to define auto healing and alerts for a low latency REST API system. Includes real world riemann code snippets.

Published in: Software
  • Be the first to comment

Monitoring patterns for mitigating technical risk

  1. 1. Monitoring Patterns for Mitigating Technical Risk @Forter
  2. 2. #1 risk Slow or bad (500) API responses Auto-healing because humans are slow SLA, Failover, Degradation, Throttling Alerting Detect, Filter, Alert, Diagnostics
  3. 3. SLA Performance Data Loss Business Logic TX Processing Low Latency Nope Best Effort Stream Processing High Throughput Best Effort Best Effort Batch Processing High Volume Nope Reconciliation
  4. 4. Automatic Failover http fencing (Incapsula) http load balancing (ELB) instance restart (Scaling Group) process restart (upstart) exceptions bubble up and crash
  5. 5. Graceful Degradation nginx (lua) expressjs (nodejs) storm (java) Stability Code Changes
  6. 6. Throttling (without back-pressure) request priority reduced when TX/sec > thresh Different priority → Different queue → Different worker lower priority inside queue for test probes
  7. 7. Detect -> Filter -> Alert -> Manual Diagnostics Alerting
  8. 8. Detection
  9. 9. filter & route
  10. 10. alert
  11. 11. diagnostics
  12. 12. redundancy CloudWatch/CollectD - fast, no root cause App events (exceptions) - too noisy, root cause Pingdom probes - low coverage, reliable Internal probes - better coverage, false alarms
  13. 13. cloudwatch pagerduty alert (no riemann)
  14. 14. system test pagerduty alert (riemann needed)
  15. 15. filter tests using a state machine
  16. 16. filter tests using a state machine (tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91")) (defn pagerduty-test-dispatch [key] (let [pd (pagerduty key)] (changed-state {:init "passed"} (where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))
  17. 17. re-open manually resolved alert
  18. 18. re-open manually resolved alert (tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91")) (defn pagerduty-test-dispatch [key] (let [pd (pagerduty key)] (sdo (changed-state {:init "passed"} (where (state "passed") (:resolve pd))) (where (state "failed") (by [:host :service] (throttle 1 60 (:trigger pd)))))))
  19. 19. Diagnostics - Storm topology timing
  20. 20. Diagnostics - Storm timelines
  21. 21. #2 risk Slowing down merchant's website Alerting Monitor each and every browser Aggregations (per browser type) Notify on thresholds
  22. 22. Monitoring our javascript snippet Timeouts Exceptions by browser Exception aggregation Monitoring new versions
  23. 23. Riemann's Index (server monitoring) key (host+service) event TTL 10.0.0.1-redisfree { "metric":"5"} 60 10.0.0.1-probe1 {"state":"failed"} 300 10.0.0.2-probe1 { "state":"passed"} 300
  24. 24. Riemann's Index key (host+service) event TTL 199.25.1.1-1234 {"state":"loaded"} 300 199.25.2.1-4567 {"state":"downloaded"} 300 199.25.3.1-8901 {"state":"loaded"} 300 For our use case: host=browser-ip, service=cookie
  25. 25. Riemann's state machine (index) stores last event and creates expired events (TTL) (by [:host :service] stream) creates a new stream for each host/service (by-host-service stream) - forter's fork only also closes stream when TTL expires
  26. 26. (defn calc-load-time [& children] (by-host-service (changed :state {:pairs? true} (smap (fn [[previous current]] (cond (and (= (:state previous) "downloaded") (= (:state current) "loaded")) (assoc previous :metric (- (:time current) (:time previous))) (and (= (:state previous) "downloaded") (= (:state current) "expired")) (assoc previous :metric (* JS_TIMEOUT 1000)))) children))))
  27. 27. (defn aggregate-by-browser [& children] (by [:browser] (fixed-time-window 60 (sdo (smap folds/median (tag "median-load-time" children)) (smap folds/count (tag "load-count" children)))))))
  28. 28. #3 risk Wrong decision (approve/decline) Alerting Anomaly detection
  29. 29. Motivation Control false alarms mathematically Threshold per customer Threshold seasonality
  30. 30. Alert me if the probability that we decline more than k out of n transactions given probability p is 1 in a million (t=0.0001%) n number of tx (30 minutes) k number of declined txs (30 minutes) p per customer declined/total (24 hours) t alert threshold
  31. 31. Binomial Distribution Assumption External events are uncorrelated What happens when a customer retries the same Tx because the first one was declined?
  32. 32. Questions? email itai@forter.com http://tech.forter.com http://www.softwarearchitectureaddict.com

×