SlideShare a Scribd company logo
1 of 63
Download to read offline
(Monitoring (and
(alerting (with
riemann))))
Moshe Zada@Forter
1 / 63
Riemann - event stream processor
think pipes
2 / 63
In the pipeline
Intro
Intro
About Forter
Low latency
3 / 63
In the pipeline
Intro
Basic alerts
Basic Alerts
Implement simple state machine
Throttled alert
Ignore spikes
4 / 63
In the pipeline
Intro
Basic alerts
Visualize
Visualize
Stream to ELK
Event enrichment
Showoff
5 / 63
In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Back to tests
Maintenance mode
Heartbeat alerts
6 / 63
In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Aggregation
Aggregation
Sum/ Count/ Max batch of events
Monitor browser javascript
7 / 63
Lets start
8 / 63
riemann@
forter
Who am I
Moshe Zada
Problem solver@Forter
Responsible for entire monitoring, CI and CD stack among
other stuff
9 / 63
riemann@
forter
Who am I
And where do I
work
Forter
10 / 63
riemann@
11 / 63
riemann@
forter
Who am I
And where do I
work
We can catch 80% of online thieves before they even get
to checkout
12 / 63
riemann@
forter
Who am I
And where do I
work
How latency effects forter?
13 / 63
riemann@
forter
Who am I
And where do I
work
Tech
Forter's low-latency stack
Using Storm and Spark for transactions stream
proccesing
Couchbase, Elasticsearch, Redis, MySQL as datastores
Immutable images
Using ELK for visabillity
14 / 63
Riemann - Basic Concepts
15 / 63
Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
16 / 63
Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
This dude
16 / 63
Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
This dude
aphyr - Kyle Kingsbury
The one from "call me
maybe"
Works at stripe
16 / 63
Basic Concepts
- Who is behind
riemann?
- Event
Events
Events are just structs.
and in Riemann are treated as immutable maps.
message Event {
optional int64 time = 1;
optional string state = 2;
optional string service = 3;
optional string host = 4;
optional string description = 5;
repeated string tags = 7;
optional float ttl = 8;
repeated Attribute attributes = 9;
optional sint64 metric_sint64 = 13;
optional double metric_d = 14;
optional float metric_f = 15;
}
message Attribute {
required string key = 1;
optional string value = 2;
}
17 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{
"service": "prod-redis-n01 Free memory",
"host": "10.0.0.1",
"description": "total memory free in bytes",
"state": nil,
"ttl": 60,
"metric": 1024,
"tags": ["collectd",
"redis",
"infra"]
}
18 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{
"service": "prod-redis-n01 Free memory",
"host": "10.0.0.1",
"description": "total memory free in bytes",
"state": nil,
"ttl": 60,
"metric": 1024,
"tags": ["collectd",
"redis",
"infra"]
}
19 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{
"service": "prod-redis-n01 Free memory",
"host": "10.0.0.1",
"description": "total memory free in bytes",
"state": nil,
"ttl": 60,
"metric": 1024,
"tags": ["collectd",
"redis",
"infra"]
}
20 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
The index
The index is a table of the current state of all services
tracked by Riemann.
keykey eventevent
10.0.0.1-redis-free { .."metric":"5", "service":"redis-free".. }
10.0.0.2-cache-miss{ .."metric":"6", "service":"cache-miss".. }
10.0.0.2-cache-hit { .."metric":"6", "service":"cache-hit".. }
21 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
22 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
After 60 secs
{"service": "foobar", "ttl": 60, state:"expired"} -> "index"
22 / 63
merchantSanity -
Implement simple state machine
23 / 63
Probes and
tests
Simple test
merchantSanity
Riemann will forward to pagerduty only events that their
state was changed
{
"service": "prod-gateway-n01 MerchantSanity system test",
"host": "10.0.0.2",
"description": "Check forters merchants api",
"state": "failure",
"ttl": 60,
"metric": 0,
"tags": ["test",
"merchantSanity"]
}
24 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
25 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
25 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
26 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity"
pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch
"Constructs a pagerduty stream which resolves and"
"triggers alerts based on test failure"
[key]
(let [pd (pagerduty "merchantSanity-service-key")]
(changed-state
(where (state "ok")
(:resolve pd))
(where (state "failure")
(:trigger pd))))
27 / 63
when things break, they submit a *ton* of events
how can I throttle them?
28 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state
(changed-state {:init "passed"}
; and the state in passed - resolve
(where (state "passed") (:resolve pd)))
; If the state of the event is failed
(where (state "failed")
; group by host and service fields
; pass only one event in 60 seconds
(by [:host :service]
(throttle 1 60 (:trigger pd))))))
29 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state
(changed-state {:init "passed"}
; and the state in passed - resolve
(where (state "passed") (:resolve pd)))
; If the state of the event is failed
(where (state "failed")
; group by host and service fields
; pass only one event in 60 seconds
(by [:host :service]
(throttle 1 60 (:trigger pd))))))
30 / 63
How can I ignore spikes (statistical alert)?
31 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
32 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
32 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
33 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch
[key]
...
(fixed-time-window 120
...
(assoc (first events)
{:metric fraction
:state (condp < fraction
0.3 "failed"
0.05 "warning"
"passed")})
(pagerduty-test-dispatch key)))
34 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Usage
(tagged "merchantSanity"
(pagerduty-test-dispatch "3adab5c52e1511e5a"))
(tagged-all ["collectd", "cpu"]
(pagerduty-probe-dispatch "4a6b58212e1511e5b" 120))
35 / 63
Visualize
36 / 63
Visualize
- Stream to ELK
Stream to ELK
(where
(and
(not (tagged-any ["kibanaIgnore"]))
(not (state "expired")))
(logstash {:host "127.0.0.1"
:pool-size 20
:claim-timeout 0.2})
37 / 63
Visualize
- Stream to ELK
Stream to ELK
(where
(and
(not (tagged-any ["kibanaIgnore"]))
(not (state "expired")))
(logstash {:host "127.0.0.1"
:pool-size 20
:claim-timeout 0.2})
37 / 63
Where can I find my events?
*prod* ?
*nimbus* ?
*merchantSanity* ?
38 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
39 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
{
"service": "prod-nimbus-instance-2015-07-19T1918 df-mnt/percent",
"host": "ip-10-139-118-128",
"metric": 100,
"tags": ["collectd"],
"time": "2015-07-19T16:45:58.000Z",
"ttl": 240,
"plugin": "df"
}
So lets split the service field !
39 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Usage
(where
(and
(not (tagged-any ["kibanaIgnore"]))
(not (state "expired")))
(enrich
(logstash {:host "127.0.0.1"
:pool-size 20
:claim-timeout 0.2}))
40 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
(defn enrich
"Parse environment settings from service name prefix"
[& children]
(apply smap
(fn stream [event]
(let [
regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)"
[all branch role deploytime subservice] (re-find #regex (:service event))
is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event))))
]
(assoc event :env (str branch "-" deploytime)
:branch branch
:deploytime deploytime
:role role
:subservice subservice
:test is-test)))
children))
41 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
(defn enrich
"Parse environment settings from service name prefix"
[& children]
(apply smap
(fn stream [event]
(let [
regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)"
[all branch role deploytime subservice] (re-find #regex (:service event))
is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event))))
]
(assoc event :env (str branch "-" deploytime)
:branch branch
:deploytime deploytime
:role role
:subservice subservice
:test is-test)))
children))
42 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
{
"service": "prod-nimbus-instance-2015-07-19T1918/df-mnt/percent_bytes-free",
"env": "prod-2015-07-19T1918",
"branch": "prod",
"deploytime": "2015-07-19T1918",
"role": "nimbus",
"subservice": "df-mnt/percent_bytes-free",
"host": "ip-10-139-118-128",
"metric": 100
}
43 / 63
Showoff
44 / 63
Visualize
- Stream to ELK
- Prepare for ELK
- Result
Storm topology with timing
45 / 63
Visualize
- Stream to ELK
- Prepare for ELK
- Result
Github intigation
46 / 63
47 / 63
Visualize
- Stream to ELK
- Prepare for ELK
- Result
Latency grouped by deploytime
48 / 63
Visualize
- Stream to ELK
- Prepare for ELK
- Result
Exception histogram by subservice
49 / 63
Visualize
- Stream to ELK
- Prepare for ELK
- Result
Collectd CPU usage by CPU id
50 / 63
BTW its all open source -
http://github.com/forter
51 / 63
Ignore irrelevant old prod alerts
/ Maintenance
52 / 63
Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode",
"ttl": 120,
"state": "active" }
53 / 63
Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode",
"ttl": 120,
"state": "active" }
And usage:
(where (and (state "failed")
(not (maintenance-mode (str (:env event) " maintenance-mode"))))
(:trigger pd))
53 / 63
How can I check heartbeat?
54 / 63
Back to tests
- Maintenance
- Heartbeat alerts
Heartbeat alerts
(defn pagerduty-cron-expiration
"Constructs a pagerduty stream which resolves"
"and triggers alerts based on event expiration"
[key]
(let [pd (custom-pagerduty key)]
(where (expired? event)
(with {:state "failed"
:description "TTL Expired. Check that the cron service"}
(pagerduty-test-dispatch key))
(else
(pagerduty-test-dispatch key)))))
55 / 63

More Related Content

What's hot

NSClient Workshop: 04 Protocols
NSClient Workshop: 04 ProtocolsNSClient Workshop: 04 Protocols
NSClient Workshop: 04 ProtocolsMichael Medin
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Amazon Web Services
 
Nmap Discovery
Nmap DiscoveryNmap Discovery
Nmap DiscoveryTai Pan
 
Network Mapper (NMAP)
Network Mapper (NMAP)Network Mapper (NMAP)
Network Mapper (NMAP)KHNOG
 
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...Amazon Web Services
 
Network scanning with nmap
Network scanning with nmapNetwork scanning with nmap
Network scanning with nmapAshish Jha
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkCloudflare
 
Network Automation with Salt and NAPALM: Introuction
Network Automation with Salt and NAPALM: IntrouctionNetwork Automation with Salt and NAPALM: Introuction
Network Automation with Salt and NAPALM: IntrouctionCloudflare
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN Riyaj Shamsudeen
 
LINE スタンプショップにおける Zipkin 利用事例
LINE スタンプショップにおける Zipkin 利用事例LINE スタンプショップにおける Zipkin 利用事例
LINE スタンプショップにおける Zipkin 利用事例LINE Corporation
 
On the way to low latency (2nd edition)
On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)Artem Orobets
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchYutaka Yasuda
 

What's hot (20)

Nmap
NmapNmap
Nmap
 
NSClient Workshop: 04 Protocols
NSClient Workshop: 04 ProtocolsNSClient Workshop: 04 Protocols
NSClient Workshop: 04 Protocols
 
NMAP - The Network Scanner
NMAP - The Network ScannerNMAP - The Network Scanner
NMAP - The Network Scanner
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
 
Nmap Discovery
Nmap DiscoveryNmap Discovery
Nmap Discovery
 
Understanding NMAP
Understanding NMAPUnderstanding NMAP
Understanding NMAP
 
Network Mapper (NMAP)
Network Mapper (NMAP)Network Mapper (NMAP)
Network Mapper (NMAP)
 
Nmap and metasploitable
Nmap and metasploitableNmap and metasploitable
Nmap and metasploitable
 
Nmap
NmapNmap
Nmap
 
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...
(PFC303) Milliseconds Matter: Design, Deploy, and Operate Your Application fo...
 
Network scanning with nmap
Network scanning with nmapNetwork scanning with nmap
Network scanning with nmap
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
Network Automation with Salt and NAPALM: Introuction
Network Automation with Salt and NAPALM: IntrouctionNetwork Automation with Salt and NAPALM: Introuction
Network Automation with Salt and NAPALM: Introuction
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
LINE スタンプショップにおける Zipkin 利用事例
LINE スタンプショップにおける Zipkin 利用事例LINE スタンプショップにおける Zipkin 利用事例
LINE スタンプショップにおける Zipkin 利用事例
 
On the way to low latency (2nd edition)
On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)
 
NMap
NMapNMap
NMap
 
Nmap commands
Nmap commandsNmap commands
Nmap commands
 
Nmap for Scriptors
Nmap for ScriptorsNmap for Scriptors
Nmap for Scriptors
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 

Similar to Monitoring Riemann Events

Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionLionel Briand
 
5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techoramaAli Kheyrollahi
 
What the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startupWhat the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startupGerrit Grunwald
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin airKonstantine Krutiy
 
Comunicação Android Arduino - JASI 2015
Comunicação Android Arduino - JASI 2015Comunicação Android Arduino - JASI 2015
Comunicação Android Arduino - JASI 2015Rodrigo Reis Alves
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...InfluxData
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerNETWAYS
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Cisco Router Security
Cisco Router SecurityCisco Router Security
Cisco Router Securitykktamang
 
The Ring programming language version 1.10 book - Part 94 of 212
The Ring programming language version 1.10 book - Part 94 of 212The Ring programming language version 1.10 book - Part 94 of 212
The Ring programming language version 1.10 book - Part 94 of 212Mahmoud Samir Fayed
 
Introduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmmIntroduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmmSimone Onofri
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Much ado about randomness. What is really a random number?
Much ado about randomness. What is really a random number?Much ado about randomness. What is really a random number?
Much ado about randomness. What is really a random number?Aleksandr Yampolskiy
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Recent my sql_performance Test detail
Recent my sql_performance Test detailRecent my sql_performance Test detail
Recent my sql_performance Test detailLouis liu
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsMatt Warren
 

Similar to Monitoring Riemann Events (20)

Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability Detection
 
5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama5 must have patterns for your microservice - techorama
5 must have patterns for your microservice - techorama
 
What the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startupWhat the CRaC - Superfast JVM startup
What the CRaC - Superfast JVM startup
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin air
 
Comunicação Android Arduino - JASI 2015
Comunicação Android Arduino - JASI 2015Comunicação Android Arduino - JASI 2015
Comunicação Android Arduino - JASI 2015
 
Verifikation - Metoder og Libraries
Verifikation - Metoder og LibrariesVerifikation - Metoder og Libraries
Verifikation - Metoder og Libraries
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
 
Learning Dtrace
Learning DtraceLearning Dtrace
Learning Dtrace
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Cisco Router Security
Cisco Router SecurityCisco Router Security
Cisco Router Security
 
The Ring programming language version 1.10 book - Part 94 of 212
The Ring programming language version 1.10 book - Part 94 of 212The Ring programming language version 1.10 book - Part 94 of 212
The Ring programming language version 1.10 book - Part 94 of 212
 
Introduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmmIntroduzione ai network penetration test secondo osstmm
Introduzione ai network penetration test secondo osstmm
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Much ado about randomness. What is really a random number?
Much ado about randomness. What is really a random number?Much ado about randomness. What is really a random number?
Much ado about randomness. What is really a random number?
 
Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Recent my sql_performance Test detail
Recent my sql_performance Test detailRecent my sql_performance Test detail
Recent my sql_performance Test detail
 
Where the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-OptimisationsWhere the wild things are - Benchmarking and Micro-Optimisations
Where the wild things are - Benchmarking and Micro-Optimisations
 

Recently uploaded

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 

Recently uploaded (20)

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 

Monitoring Riemann Events

  • 2. Riemann - event stream processor think pipes 2 / 63
  • 3. In the pipeline Intro Intro About Forter Low latency 3 / 63
  • 4. In the pipeline Intro Basic alerts Basic Alerts Implement simple state machine Throttled alert Ignore spikes 4 / 63
  • 5. In the pipeline Intro Basic alerts Visualize Visualize Stream to ELK Event enrichment Showoff 5 / 63
  • 6. In the pipeline Intro Basic alerts Visualize Back to tests Back to tests Maintenance mode Heartbeat alerts 6 / 63
  • 7. In the pipeline Intro Basic alerts Visualize Back to tests Aggregation Aggregation Sum/ Count/ Max batch of events Monitor browser javascript 7 / 63
  • 9. riemann@ forter Who am I Moshe Zada Problem solver@Forter Responsible for entire monitoring, CI and CD stack among other stuff 9 / 63
  • 10. riemann@ forter Who am I And where do I work Forter 10 / 63
  • 12. riemann@ forter Who am I And where do I work We can catch 80% of online thieves before they even get to checkout 12 / 63
  • 13. riemann@ forter Who am I And where do I work How latency effects forter? 13 / 63
  • 14. riemann@ forter Who am I And where do I work Tech Forter's low-latency stack Using Storm and Spark for transactions stream proccesing Couchbase, Elasticsearch, Redis, MySQL as datastores Immutable images Using ELK for visabillity 14 / 63
  • 15. Riemann - Basic Concepts 15 / 63
  • 16. Basic Concepts - Who is behind riemann? Who is behind riemann? 16 / 63
  • 17. Basic Concepts - Who is behind riemann? Who is behind riemann? This dude 16 / 63
  • 18. Basic Concepts - Who is behind riemann? Who is behind riemann? This dude aphyr - Kyle Kingsbury The one from "call me maybe" Works at stripe 16 / 63
  • 19. Basic Concepts - Who is behind riemann? - Event Events Events are just structs. and in Riemann are treated as immutable maps. message Event { optional int64 time = 1; optional string state = 2; optional string service = 3; optional string host = 4; optional string description = 5; repeated string tags = 7; optional float ttl = 8; repeated Attribute attributes = 9; optional sint64 metric_sint64 = 13; optional double metric_d = 14; optional float metric_f = 15; } message Attribute { required string key = 1; optional string value = 2; } 17 / 63
  • 20. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 18 / 63
  • 21. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 19 / 63
  • 22. Basic Concepts - Who is behind riemann? - Event - Examples Sample event Collectd event { "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"] } 20 / 63
  • 23. Basic Concepts - Who is behind riemann? - Event - Examples - The index The index The index is a table of the current state of all services tracked by Riemann. keykey eventevent 10.0.0.1-redis-free { .."metric":"5", "service":"redis-free".. } 10.0.0.2-cache-miss{ .."metric":"6", "service":"cache-miss".. } 10.0.0.2-cache-hit { .."metric":"6", "service":"cache-hit".. } 21 / 63
  • 24. Basic Concepts - Who is behind riemann? - Event - Examples - The index - TTL TTL Events entered into the index have a :ttl field which indicate how long that event is valid for. {"service": "foobar", "ttl": 60, state:"pass"} -> "index" 22 / 63
  • 25. Basic Concepts - Who is behind riemann? - Event - Examples - The index - TTL TTL Events entered into the index have a :ttl field which indicate how long that event is valid for. {"service": "foobar", "ttl": 60, state:"pass"} -> "index" After 60 secs {"service": "foobar", "ttl": 60, state:"expired"} -> "index" 22 / 63
  • 26. merchantSanity - Implement simple state machine 23 / 63
  • 27. Probes and tests Simple test merchantSanity Riemann will forward to pagerduty only events that their state was changed { "service": "prod-gateway-n01 MerchantSanity system test", "host": "10.0.0.2", "description": "Check forters merchants api", "state": "failure", "ttl": 60, "metric": 0, "tags": ["test", "merchantSanity"] } 24 / 63
  • 28. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" 25 / 63
  • 29. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 25 / 63
  • 30. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 26 / 63
  • 31. Probes and tests Simple test Flow "probe machine" --> "riemann" --> "pagerduty" The code behind (tagged "merchantSanity" pagerduty-test-dispatch "asdasdad") (defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd)))) 27 / 63
  • 32. when things break, they submit a *ton* of events how can I throttle them? 28 / 63
  • 33. Probes and tests Simple test Test dispatch - throttled Throttle alerts Sometimes, when things break, they submit a ton of events. ; If changed state (changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd))) ; If the state of the event is failed (where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd)))))) 29 / 63
  • 34. Probes and tests Simple test Test dispatch - throttled Throttle alerts Sometimes, when things break, they submit a ton of events. ; If changed state (changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd))) ; If the state of the event is failed (where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd)))))) 30 / 63
  • 35. How can I ignore spikes (statistical alert)? 31 / 63
  • 36. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger 32 / 63
  • 37. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 32 / 63
  • 38. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 33 / 63
  • 39. Probes and tests Simple test Test dispatch - throttled CPU spikes Monitoring Infra - ignore spikes Collectd gether our instance cpu info If >30% failed - Trigger (defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key))) 34 / 63
  • 40. Probes and tests Simple test Test dispatch - throttled CPU spikes Usage (tagged "merchantSanity" (pagerduty-test-dispatch "3adab5c52e1511e5a")) (tagged-all ["collectd", "cpu"] (pagerduty-probe-dispatch "4a6b58212e1511e5b" 120)) 35 / 63
  • 42. Visualize - Stream to ELK Stream to ELK (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2}) 37 / 63
  • 43. Visualize - Stream to ELK Stream to ELK (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2}) 37 / 63
  • 44. Where can I find my events? *prod* ? *nimbus* ? *merchantSanity* ? 38 / 63
  • 45. Visualize - Stream to ELK - Prepare for ELK Where can I find my events? branch : prod role : nimbus deployitme : 2015-07-19T1918 39 / 63
  • 46. Visualize - Stream to ELK - Prepare for ELK Where can I find my events? branch : prod role : nimbus deployitme : 2015-07-19T1918 { "service": "prod-nimbus-instance-2015-07-19T1918 df-mnt/percent", "host": "ip-10-139-118-128", "metric": 100, "tags": ["collectd"], "time": "2015-07-19T16:45:58.000Z", "ttl": 240, "plugin": "df" } So lets split the service field ! 39 / 63
  • 47. Visualize - Stream to ELK - Prepare for ELK Usage (where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (enrich (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2})) 40 / 63
  • 48. Visualize - Stream to ELK - Prepare for ELK Enrich (defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children)) 41 / 63
  • 49. Visualize - Stream to ELK - Prepare for ELK Enrich (defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?-feature|prod)-([w-]+)-instance-(w+-w+-w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children)) 42 / 63
  • 50. Visualize - Stream to ELK - Prepare for ELK Enrich { "service": "prod-nimbus-instance-2015-07-19T1918/df-mnt/percent_bytes-free", "env": "prod-2015-07-19T1918", "branch": "prod", "deploytime": "2015-07-19T1918", "role": "nimbus", "subservice": "df-mnt/percent_bytes-free", "host": "ip-10-139-118-128", "metric": 100 } 43 / 63
  • 52. Visualize - Stream to ELK - Prepare for ELK - Result Storm topology with timing 45 / 63
  • 53. Visualize - Stream to ELK - Prepare for ELK - Result Github intigation 46 / 63
  • 55. Visualize - Stream to ELK - Prepare for ELK - Result Latency grouped by deploytime 48 / 63
  • 56. Visualize - Stream to ELK - Prepare for ELK - Result Exception histogram by subservice 49 / 63
  • 57. Visualize - Stream to ELK - Prepare for ELK - Result Collectd CPU usage by CPU id 50 / 63
  • 58. BTW its all open source - http://github.com/forter 51 / 63
  • 59. Ignore irrelevant old prod alerts / Maintenance 52 / 63
  • 60. Back to tests - Maintenance Maintenance Mode Sending "maintenance-mode" event Riemann query its own index for the "maintenance- mode" event if exist - ignore Enable: { "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" } 53 / 63
  • 61. Back to tests - Maintenance Maintenance Mode Sending "maintenance-mode" event Riemann query its own index for the "maintenance- mode" event if exist - ignore Enable: { "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" } And usage: (where (and (state "failed") (not (maintenance-mode (str (:env event) " maintenance-mode")))) (:trigger pd)) 53 / 63
  • 62. How can I check heartbeat? 54 / 63
  • 63. Back to tests - Maintenance - Heartbeat alerts Heartbeat alerts (defn pagerduty-cron-expiration "Constructs a pagerduty stream which resolves" "and triggers alerts based on event expiration" [key] (let [pd (custom-pagerduty key)] (where (expired? event) (with {:state "failed" :description "TTL Expired. Check that the cron service"} (pagerduty-test-dispatch key)) (else (pagerduty-test-dispatch key))))) 55 / 63