6. In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Back to tests
Maintenance mode
Heartbeat alerts
6 / 63
7. In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Aggregation
Aggregation
Sum/ Count/ Max batch of events
Monitor browser javascript
7 / 63
14. riemann@
forter
Who am I
And where do I
work
Tech
Forter's low-latency stack
Using Storm and Spark for transactions stream
proccesing
Couchbase, Elasticsearch, Redis, MySQL as datastores
Immutable images
Using ELK for visabillity
14 / 63
17. Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
This dude
16 / 63
18. Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
This dude
aphyr - Kyle Kingsbury
The one from "call me
maybe"
Works at stripe
16 / 63
19. Basic Concepts
- Who is behind
riemann?
- Event
Events
Events are just structs.
and in Riemann are treated as immutable maps.
message Event {
optional int64 time = 1;
optional string state = 2;
optional string service = 3;
optional string host = 4;
optional string description = 5;
repeated string tags = 7;
optional float ttl = 8;
repeated Attribute attributes = 9;
optional sint64 metric_sint64 = 13;
optional double metric_d = 14;
optional float metric_f = 15;
}
message Attribute {
required string key = 1;
optional string value = 2;
}
17 / 63
23. Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
The index
The index is a table of the current state of all services
tracked by Riemann.
keykey eventevent
10.0.0.1-redis-free { .."metric":"5", "service":"redis-free".. }
10.0.0.2-cache-miss{ .."metric":"6", "service":"cache-miss".. }
10.0.0.2-cache-hit { .."metric":"6", "service":"cache-hit".. }
21 / 63
24. Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
22 / 63
25. Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
After 60 secs
{"service": "foobar", "ttl": 60, state:"expired"} -> "index"
22 / 63
27. Probes and
tests
Simple test
merchantSanity
Riemann will forward to pagerduty only events that their
state was changed
{
"service": "prod-gateway-n01 MerchantSanity system test",
"host": "10.0.0.2",
"description": "Check forters merchants api",
"state": "failure",
"ttl": 60,
"metric": 0,
"tags": ["test",
"merchantSanity"]
}
24 / 63
29. Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
25 / 63
30. Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
26 / 63
31. Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
27 / 63
32. when things break, they submit a *ton* of events
how can I throttle them?
28 / 63
33. Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state
(changed-state {:init "passed"}
; and the state in passed - resolve
(where (state "passed") (:resolve pd)))
; If the state of the event is failed
(where (state "failed")
; group by host and service fields
; pass only one event in 60 seconds
(by [:host :service]
(throttle 1 60 (:trigger pd))))))
29 / 63
34. Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state
(changed-state {:init "passed"}
; and the state in passed - resolve
(where (state "passed") (:resolve pd)))
; If the state of the event is failed
(where (state "failed")
; group by host and service fields
; pass only one event in 60 seconds
(by [:host :service]
(throttle 1 60 (:trigger pd))))))
30 / 63
35. How can I ignore spikes (statistical alert)?
31 / 63
36. Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
32 / 63
37. Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
32 / 63
38. Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
33 / 63
39. Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
34 / 63
40. Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Usage
(tagged "merchantSanity"
(pagerduty-test-dispatch "3adab5c52e1511e5a"))
(tagged-all ["collectd", "cpu"]
(pagerduty-probe-dispatch "4a6b58212e1511e5b" 120))
35 / 63
44. Where can I find my events?
*prod* ?
*nimbus* ?
*merchantSanity* ?
38 / 63
45. Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
39 / 63
46. Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
{
"service": "prod-nimbus-instance-2015-07-19T1918 df-mnt/percent",
"host": "ip-10-139-118-128",
"metric": 100,
"tags": ["collectd"],
"time": "2015-07-19T16:45:58.000Z",
"ttl": 240,
"plugin": "df"
}
So lets split the service field !
39 / 63
60. Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode",
"ttl": 120,
"state": "active" }
53 / 63
61. Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode",
"ttl": 120,
"state": "active" }
And usage:
(where (and (state "failed")
(not (maintenance-mode (str (:env event) " maintenance-mode"))))
(:trigger pd))
53 / 63