Monitoring patterns for mitigating technical risk

Monitoring Patterns for
Mitigating Technical Risk
@Forter

#1 risk
Slow or bad (500) API responses
Auto-healing
because humans are slow
SLA, Failover, Degradation, Throttling
Alerting
Detect, Filter, Alert, Diagnostics

SLA
Performance Data Loss Business Logic
TX Processing Low Latency Nope Best Effort
Stream
Processing High Throughput Best Effort Best Effort
Batch Processing High Volume Nope Reconciliation

Automatic Failover
http fencing (Incapsula)
http load balancing (ELB)
instance restart (Scaling Group)
process restart (upstart)
exceptions bubble up and crash

Graceful Degradation
nginx (lua)
expressjs (nodejs)
storm (java)
Stability
Code
Changes

Throttling (without back-pressure)
request priority reduced when TX/sec > thresh
Different priority →
Different queue →
Different worker
lower priority inside queue for test probes

Detect -> Filter -> Alert -> Manual Diagnostics
Alerting

redundancy
CloudWatch/CollectD - fast, no root cause
App events (exceptions) - too noisy, root cause
Pingdom probes - low coverage, reliable
Internal probes - better coverage, false alarms

cloudwatch
pagerduty
alert
(no riemann)

system test
pagerduty
alert
(riemann needed)

filter tests using a state machine

filter tests using a state machine
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd))
(where (state "failed") (:trigger pd)))))

re-open manually resolved alert

re-open manually resolved alert
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(sdo
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd)))
(where (state "failed")
(by [:host :service]
(throttle 1 60 (:trigger pd)))))))

Diagnostics - Storm topology timing

#2 risk
Slowing down merchant's website
Alerting
Monitor each and every browser
Aggregations (per browser type)
Notify on thresholds

Monitoring our javascript snippet
Timeouts
Exceptions by browser
Exception aggregation
Monitoring new versions

Riemann's Index (server monitoring)
key (host+service) event TTL
10.0.0.1-redisfree { "metric":"5"} 60
10.0.0.1-probe1 {"state":"failed"} 300
10.0.0.2-probe1 { "state":"passed"} 300

Riemann's Index
key (host+service) event TTL
199.25.1.1-1234 {"state":"loaded"} 300
199.25.2.1-4567 {"state":"downloaded"} 300
199.25.3.1-8901 {"state":"loaded"} 300
For our use case:
host=browser-ip, service=cookie

Riemann's state machine
(index)
stores last event and creates expired events (TTL)
(by [:host :service] stream)
creates a new stream for each host/service
(by-host-service stream) - forter's fork only
also closes stream when TTL expires

(defn calc-load-time [& children]
(by-host-service
(changed :state {:pairs? true}
(smap (fn [[previous current]]
(cond
(and (= (:state previous) "downloaded")
(= (:state current) "loaded"))
(assoc previous :metric (- (:time current) (:time previous)))
(and (= (:state previous) "downloaded")
(= (:state current) "expired"))
(assoc previous :metric (* JS_TIMEOUT 1000))))
children))))

(defn aggregate-by-browser [& children]
(by [:browser]
(fixed-time-window 60
(sdo
(smap folds/median
(tag "median-load-time"
children))
(smap folds/count
(tag "load-count"
children)))))))

#3 risk
Wrong decision (approve/decline)
Alerting
Anomaly detection

Motivation
Control false alarms mathematically
Threshold per customer
Threshold seasonality

Alert me if
the probability that we decline
more than k out of n transactions
given probability p
is 1 in a million (t=0.0001%)
n number of tx (30 minutes)
k number of declined txs (30 minutes)
p per customer declined/total (24 hours)
t alert threshold

Binomial Distribution Assumption
External events are uncorrelated
What happens when a customer retries the
same Tx because the first one was declined?

Questions?
email itai@forter.com
http://tech.forter.com
http://www.softwarearchitectureaddict.com

Monitoring patterns for mitigating technical risk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Monitoring patterns for mitigating technical risk

Similar to Monitoring patterns for mitigating technical risk (20)

Recently uploaded

Recently uploaded (20)

Monitoring patterns for mitigating technical risk

Editor's Notes