Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

StatsCraft 2015: Monitoring using riemann - Moshe Zada

3,577 views

Published on

Slides of Moshe Zada's talk at StatsCraft 2015 event

Published in: Engineering
  • Be the first to comment

StatsCraft 2015: Monitoring using riemann - Moshe Zada

  1. 1. (Monitoring (and (alerting (with riemann)))) Moshe Zada@Forter 1 / 63
  2. 2. Riemann - event stream processor think pipes 2 / 63
  3. 3. In the pipeline Intro Intro About Forter Low latency 3 / 63
  4. 4. In the pipeline Intro Basic alerts Basic Alerts Implement simple state machine Throttled alert Ignore spikes 4 / 63
  5. 5. In the pipeline Intro Basic alerts Visualize Visualize Stream to ELK Event enrichment Showoff 5 / 63
  6. 6. In the pipeline Intro Basic alerts Visualize Back to tests Back to tests Maintenance mode Heartbeat alerts 6 / 63
  7. 7. In the pipeline Intro Basic alerts Visualize Back to tests Aggregation Aggregation Sum/ Count/ Max batch of events Monitor browser javascript 7 / 63
  8. 8. Lets start 8 / 63
  9. 9. riemann@ forter Who am I Moshe Zada Problem solver@Forter Responsible for entire monitoring, CI and CD stack among other stuff 9 / 63
  10. 10. riemann@ forter Who am I And where do I work Forter 10 / 63
  11. 11. riemann@ 11 / 63
  12. 12. riemann@ forter Who am I And where do I work We can catch 80% of online thieves before they even get to checkout 12 / 63
  13. 13. riemann@ forter Who am I And where do I work How latency effects forter? 13 / 63
  14. 14. riemann@ forter Who am I And where do I work Tech Forter's low-latency stack Using Storm and Spark for transactions stream proccesing Couchbase, Elasticsearch, Redis, MySQL as datastores Immutable images Using ELK for visabillity 14 / 63
  15. 15. Riemann - Basic Concepts 15 / 63
  16. 16. Basic Concepts - Who is behind riemann? Who is behind riemann? 16 / 63
  17. 17. Basic Concepts - Who is behind riemann? Who is behind riemann? This dude 16 / 63
  18. 18. Basic Concepts - Who is behind riemann? Who is behind riemann? This dude aphyr - Kyle Kingsbury The one from "call me maybe" Works at stripe 16 / 63
  19. 19. Basic Concepts - Who is behind riemann? - Event Events Events are just structs. and in Riemann are treated as immutable maps. message Event { optional int64 time = 1; optional string state = 2; optional string service = 3; optional string host = 4; optional string description = 5; repeated string tags = 7; optional float ttl = 8; repeated Attribute attributes = 9; optional sint64 metric_sint64 = 13; optional double metric_d = 14; optional float metric_f = 15; } message Attribute { required string key = 1; optional string value = 2; } 17 / 63
  20. 20. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 18 / 63
  21. 21. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 19 / 63
  22. 22. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 20 / 63
  23. 23. Basic Concepts - Who is behind riemann? - Event - Examples - The index The index The index is a table of the current state of all services tracked by Riemann. keykey eventevent 10.0.0.1-redis-free { .."metric":"5", "service":"redis-free".. } 10.0.0.2-cache-miss{ .."metric":"6", "service":"cache-miss".. } 10.0.0.2-cache-hit { .."metric":"6", "service":"cache-hit".. } 21 / 63
  24. 24. Basic Concepts - Who is behind riemann? - Event - Examples - The index - TTL TTL Events entered into the index have a :ttl field which indicate how long that event is valid for. {"service": "foobar", "ttl": 60, state:"pass"} -> "index" 22 / 63
  25. 25. Basic Concepts - Who is behind riemann? - Event - Examples - The index - TTL TTL Events entered into the index have a :ttl field which indicate how long that event is valid for. {"service": "foobar", "ttl": 60, state:"pass"} -> "index" After 60 secs {"service": "foobar", "ttl": 60, state:"expired"} -> "index" 22 / 63
  26. 26. merchantSanity - Implement simple state machine 23 / 63
  27. 27. Probes and tests Simple test merchantSanity Riemann will forward to pagerduty only events that their state was changed { "service": "prod-gateway-n01 MerchantSanity system test", "host": "10.0.0.2", "description": "Check forters merchants api", "state": "failure", "ttl": 60, "metric": 0, "tags": ["test", "merchantSanity"] } 24 / 63
  28. 28. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" 25 / 63
  29. 29. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 25 / 63
  30. 30. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 26 / 63
  31. 31. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 27 / 63
  32. 32. when things break, they submit a *ton* of events how can I throttle them? 28 / 63
  33. 33. Probes and tests Simple test Test dispatch - throttled Throttle alerts Sometimes, when things break, they submit a ton of events. ; If changed state (changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd))) ; If the state of the event is failed (where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd)))))) 29 / 63
  34. 34. Probes and tests Simple test Test dispatch - throttled Throttle alerts Sometimes, when things break, they submit a ton of events. ; If changed state (changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd))) ; If the state of the event is failed (where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd)))))) 30 / 63
  35. 35. How can I ignore spikes (statistical alert)? 31 / 63
  36. 36. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger 32 / 63
  37. 37. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 32 / 63
  38. 38. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 33 / 63
  39. 39. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 34 / 63
  40. 40. Probes and tests Simple test Test dispatch - throttled CPU spikes Usage (tagged "merchantSanity" (pagerduty-test-dispatch "3adab5c52e1511e5a")) (tagged-all ["collectd", "cpu"] (pagerduty-probe-dispatch "4a6b58212e1511e5b" 120)) 35 / 63
  41. 41. Visualize 36 / 63
  42. 42. Visualize - Stream to ELK Stream to ELK (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2}) 37 / 63
  43. 43. Visualize - Stream to ELK Stream to ELK (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2}) 37 / 63
  44. 44. Where can I find my events? *prod* ? *nimbus* ? *merchantSanity* ? 38 / 63
  45. 45. Visualize - Stream to ELK - Prepare for ELK Where can I find my events? branch : prod role : nimbus deployitme : 2015-07-19T1918 39 / 63
  46. 46. Visualize - Stream to ELK - Prepare for ELK Where can I find my events? branch : prod role : nimbus deployitme : 2015-07-19T1918 { "service": "prod-nimbus-instance-2015-07-19T1918 df-mnt/percent", "host": "ip-10-139-118-128", "metric": 100, "tags": ["collectd"], "time": "2015-07-19T16:45:58.000Z", "ttl": 240, "plugin": "df" } So lets split the service field ! 39 / 63
  47. 47. Visualize - Stream to ELK - Prepare for ELK Usage (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (enrich (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2})) 40 / 63
  48. 48. Visualize - Stream to ELK - Prepare for ELK Enrich (defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children)) 41 / 63
  49. 49. Visualize - Stream to ELK - Prepare for ELK Enrich (defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children)) 42 / 63
  50. 50. Visualize - Stream to ELK - Prepare for ELK Enrich { "service": "prod-nimbus-instance-2015-07-19T1918/df-mnt/percent_bytes-free", "env": "prod-2015-07-19T1918", "branch": "prod", "deploytime": "2015-07-19T1918", "role": "nimbus", "subservice": "df-mnt/percent_bytes-free", "host": "ip-10-139-118-128", "metric": 100 } 43 / 63
  51. 51. Showoff 44 / 63
  52. 52. Visualize - Stream to ELK - Prepare for ELK - Result Storm topology with timing 45 / 63
  53. 53. Visualize - Stream to ELK - Prepare for ELK - Result Github intigation 46 / 63
  54. 54. 47 / 63
  55. 55. Visualize - Stream to ELK - Prepare for ELK - Result Latency grouped by deploytime 48 / 63
  56. 56. Visualize - Stream to ELK - Prepare for ELK - Result Exception histogram by subservice 49 / 63
  57. 57. Visualize - Stream to ELK - Prepare for ELK - Result Collectd CPU usage by CPU id 50 / 63
  58. 58. BTW its all open source - http://github.com/forter 51 / 63
  59. 59. Ignore irrelevant old prod alerts / Maintenance 52 / 63
  60. 60. Back to tests - Maintenance Maintenance Mode Sending "maintenance-mode" event Riemann query its own index for the "maintenance- mode" event if exist - ignore Enable: { "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" } 53 / 63
  61. 61. Back to tests - Maintenance Maintenance Mode Sending "maintenance-mode" event Riemann query its own index for the "maintenance- mode" event if exist - ignore Enable: { "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" } And usage: (where (and (state "failed") (not (maintenance-mode (str (:env event) " maintenance-mode")))) (:trigger pd)) 53 / 63
  62. 62. How can I check heartbeat? 54 / 63
  63. 63. Back to tests - Maintenance - Heartbeat alerts Heartbeat alerts (defn pagerduty-cron-expiration "Constructs a pagerduty stream which resolves" "and triggers alerts based on event expiration" [key] (let [pd (custom-pagerduty key)] (where (expired? event) (with {:state "failed" :description "TTL Expired. Check that the cron service"} (pagerduty-test-dispatch key)) (else (pagerduty-test-dispatch key))))) 55 / 63

×