SlideShare a Scribd company logo
Monitoring Patterns for
Mitigating Technical Risk
@Forter
#1 risk
Slow or bad (500) API responses
Auto-healing
because humans are slow
SLA, Failover, Degradation, Throttling
Alerting
Detect, Filter, Alert, Diagnostics
SLA
Performance Data Loss Business Logic
TX Processing Low Latency Nope Best Effort
Stream
Processing High Throughput Best Effort Best Effort
Batch Processing High Volume Nope Reconciliation
Automatic Failover
http fencing (Incapsula)
http load balancing (ELB)
instance restart (Scaling Group)
process restart (upstart)
exceptions bubble up and crash
Graceful Degradation
nginx (lua)
expressjs (nodejs)
storm (java)
Stability
Code
Changes
Throttling (without back-pressure)
request priority reduced when TX/sec > thresh
Different priority →
Different queue →
Different worker
lower priority inside queue for test probes
Detect -> Filter -> Alert -> Manual Diagnostics
Alerting
Detection
filter &
route
alert
diagnostics
redundancy
CloudWatch/CollectD - fast, no root cause
App events (exceptions) - too noisy, root cause
Pingdom probes - low coverage, reliable
Internal probes - better coverage, false alarms
cloudwatch
pagerduty
alert
(no riemann)
system test
pagerduty
alert
(riemann needed)
filter tests using a state machine
filter tests using a state machine
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd))
(where (state "failed") (:trigger pd)))))
re-open manually resolved alert
re-open manually resolved alert
(tagged "apisRegression"
(pagerduty-test-dispatch "1234567892ed295d91"))
(defn pagerduty-test-dispatch
[key]
(let [pd (pagerduty key)]
(sdo
(changed-state {:init "passed"}
(where (state "passed") (:resolve pd)))
(where (state "failed")
(by [:host :service]
(throttle 1 60 (:trigger pd)))))))
Diagnostics - Storm topology timing
Diagnostics - Storm timelines
#2 risk
Slowing down merchant's website
Alerting
Monitor each and every browser
Aggregations (per browser type)
Notify on thresholds
Monitoring our javascript snippet
Timeouts
Exceptions by browser
Exception aggregation
Monitoring new versions
Riemann's Index (server monitoring)
key (host+service) event TTL
10.0.0.1-redisfree { "metric":"5"} 60
10.0.0.1-probe1 {"state":"failed"} 300
10.0.0.2-probe1 { "state":"passed"} 300
Riemann's Index
key (host+service) event TTL
199.25.1.1-1234 {"state":"loaded"} 300
199.25.2.1-4567 {"state":"downloaded"} 300
199.25.3.1-8901 {"state":"loaded"} 300
For our use case:
host=browser-ip, service=cookie
Riemann's state machine
(index)
stores last event and creates expired events (TTL)
(by [:host :service] stream)
creates a new stream for each host/service
(by-host-service stream) - forter's fork only
also closes stream when TTL expires
(defn calc-load-time [& children]
(by-host-service
(changed :state {:pairs? true}
(smap (fn [[previous current]]
(cond
(and (= (:state previous) "downloaded")
(= (:state current) "loaded"))
(assoc previous :metric (- (:time current) (:time previous)))
(and (= (:state previous) "downloaded")
(= (:state current) "expired"))
(assoc previous :metric (* JS_TIMEOUT 1000))))
children))))
(defn aggregate-by-browser [& children]
(by [:browser]
(fixed-time-window 60
(sdo
(smap folds/median
(tag "median-load-time"
children))
(smap folds/count
(tag "load-count"
children)))))))
#3 risk
Wrong decision (approve/decline)
Alerting
Anomaly detection
Motivation
Control false alarms mathematically
Threshold per customer
Threshold seasonality
Alert me if
the probability that we decline
more than k out of n transactions
given probability p
is 1 in a million (t=0.0001%)
n number of tx (30 minutes)
k number of declined txs (30 minutes)
p per customer declined/total (24 hours)
t alert threshold
Binomial Distribution Assumption
External events are uncorrelated
What happens when a customer retries the
same Tx because the first one was declined?
Questions?
email itai@forter.com
http://tech.forter.com
http://www.softwarearchitectureaddict.com

More Related Content

What's hot

Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
Amazon Web Services
 
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
Role-Based Access Control
Role-Based Access ControlRole-Based Access Control
Role-Based Access Control
EmpowerID
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
Araf Karsh Hamid
 
What is an API Gateway?
What is an API Gateway?What is an API Gateway?
What is an API Gateway?
LunchBadger
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
Amazon Web Services
 
Performance Testing using Loadrunner
Performance Testingusing LoadrunnerPerformance Testingusing Loadrunner
Performance Testing using Loadrunner
hmfive
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)
Site24x7
 
Event Streaming in Retail with Apache Kafka
Event Streaming in Retail with Apache KafkaEvent Streaming in Retail with Apache Kafka
Event Streaming in Retail with Apache Kafka
Kai Wähner
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
Adam Kotwasinski
 
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
apidays
 
Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
Amazon Web Services
 
Static Application Security Testing Strategies for Automation and Continuous ...
Static Application Security Testing Strategies for Automation and Continuous ...Static Application Security Testing Strategies for Automation and Continuous ...
Static Application Security Testing Strategies for Automation and Continuous ...
Kevin Fealey
 
Spring Cloud Gateway
Spring Cloud GatewaySpring Cloud Gateway
Spring Cloud Gateway
VMware Tanzu
 
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
Amazon Web Services
 
Introduction to Kong API Gateway
Introduction to Kong API GatewayIntroduction to Kong API Gateway
Introduction to Kong API Gateway
Yohann Ciurlik
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
Bruno Pedro
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards Kubernetes
QAware GmbH
 

What's hot (20)

Using AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic ProtectionUsing AWS WAF and Lambda for Automatic Protection
Using AWS WAF and Lambda for Automatic Protection
 
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
 
Terraform
TerraformTerraform
Terraform
 
Role-Based Access Control
Role-Based Access ControlRole-Based Access Control
Role-Based Access Control
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
 
What is an API Gateway?
What is an API Gateway?What is an API Gateway?
What is an API Gateway?
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 
Performance Testing using Loadrunner
Performance Testingusing LoadrunnerPerformance Testingusing Loadrunner
Performance Testing using Loadrunner
 
Application Performance Monitoring (APM)
Application Performance Monitoring (APM)Application Performance Monitoring (APM)
Application Performance Monitoring (APM)
 
Event Streaming in Retail with Apache Kafka
Event Streaming in Retail with Apache KafkaEvent Streaming in Retail with Apache Kafka
Event Streaming in Retail with Apache Kafka
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
APIsecure 2023 - API orchestration: to build resilient applications, Cherish ...
 
Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
 
Static Application Security Testing Strategies for Automation and Continuous ...
Static Application Security Testing Strategies for Automation and Continuous ...Static Application Security Testing Strategies for Automation and Continuous ...
Static Application Security Testing Strategies for Automation and Continuous ...
 
Spring Cloud Gateway
Spring Cloud GatewaySpring Cloud Gateway
Spring Cloud Gateway
 
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
Best Practices for CI/CD with AWS Lambda and Amazon API Gateway (SRV355-R1) -...
 
Introduction to Kong API Gateway
Introduction to Kong API GatewayIntroduction to Kong API Gateway
Introduction to Kong API Gateway
 
How to Automate API Testing
How to Automate API TestingHow to Automate API Testing
How to Automate API Testing
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards Kubernetes
 

Viewers also liked

Car Alarms & Smoke Alarms [Monitorama]
Car Alarms & Smoke Alarms [Monitorama]Car Alarms & Smoke Alarms [Monitorama]
Car Alarms & Smoke Alarms [Monitorama]
Dan Slimmon
 
DevOps & Security from an Enterprise Toolsmith's Perspective
DevOps & Security from an Enterprise Toolsmith's PerspectiveDevOps & Security from an Enterprise Toolsmith's Perspective
DevOps & Security from an Enterprise Toolsmith's Perspective
dev2ops
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...People10 Technosoft Private Limited
 
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPMAMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
Matt Wright
 
Achieving DevOps using Open Source Tools in the Enterprise
Achieving DevOps using Open Source Tools in the EnterpriseAchieving DevOps using Open Source Tools in the Enterprise
Achieving DevOps using Open Source Tools in the Enterprise
CollabNet
 
Continuous Deployment at Etsy: A Tale of Two Approaches
Continuous Deployment at Etsy: A Tale of Two ApproachesContinuous Deployment at Etsy: A Tale of Two Approaches
Continuous Deployment at Etsy: A Tale of Two ApproachesRoss Snyder
 
DevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than TechnologyDevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than Technology
CA Technologies
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
Nishanth K Hydru
 
Accenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of businessAccenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of business
Accenture Technology
 

Viewers also liked (10)

Car Alarms & Smoke Alarms [Monitorama]
Car Alarms & Smoke Alarms [Monitorama]Car Alarms & Smoke Alarms [Monitorama]
Car Alarms & Smoke Alarms [Monitorama]
 
DevOps & Security from an Enterprise Toolsmith's Perspective
DevOps & Security from an Enterprise Toolsmith's PerspectiveDevOps & Security from an Enterprise Toolsmith's Perspective
DevOps & Security from an Enterprise Toolsmith's Perspective
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...
 
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPMAMIS 25: DevOps Best Practice for Oracle SOA and BPM
AMIS 25: DevOps Best Practice for Oracle SOA and BPM
 
Achieving DevOps using Open Source Tools in the Enterprise
Achieving DevOps using Open Source Tools in the EnterpriseAchieving DevOps using Open Source Tools in the Enterprise
Achieving DevOps using Open Source Tools in the Enterprise
 
Continuous Deployment at Etsy: A Tale of Two Approaches
Continuous Deployment at Etsy: A Tale of Two ApproachesContinuous Deployment at Etsy: A Tale of Two Approaches
Continuous Deployment at Etsy: A Tale of Two Approaches
 
DevOps
DevOpsDevOps
DevOps
 
DevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than TechnologyDevOps: A Culture Transformation, More than Technology
DevOps: A Culture Transformation, More than Technology
 
Introducing DevOps
Introducing DevOpsIntroducing DevOps
Introducing DevOps
 
Accenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of businessAccenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of business
 

Similar to Monitoring patterns for mitigating technical risk

"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.
Vladimir Pavkin
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
Itamar
 
Failure Self Defense: Defend your App against failures in a (micro) services ...
Failure Self Defense: Defend your App against failures in a (micro) services ...Failure Self Defense: Defend your App against failures in a (micro) services ...
Failure Self Defense: Defend your App against failures in a (micro) services ...
Tony Fabeen
 
Zabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshopZabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshop
Zabbix
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
javier ramirez
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet
 
Practical non blocking microservices in java 8
Practical non blocking microservices in java 8Practical non blocking microservices in java 8
Practical non blocking microservices in java 8
Michal Balinski
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Nagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
Nagios Conference 2007 | Nagios in very large Environments by Werner NeunteuflNagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
Nagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
NETWAYS
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
Reactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and DockerReactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and Docker
John Scattergood
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
Cloudflare
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachine
Wooga
 
Cisco Router Security
Cisco Router SecurityCisco Router Security
Cisco Router Security
kktamang
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
Stewart Needham
 
FME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeFME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-time
Safe Software
 

Similar to Monitoring patterns for mitigating technical risk (20)

"How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics."How about no grep and zabbix?". ELK based alerts and metrics.
"How about no grep and zabbix?". ELK based alerts and metrics.
 
Logging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friendsLogging makes perfect - Riemann, Elasticsearch and friends
Logging makes perfect - Riemann, Elasticsearch and friends
 
Failure Self Defense: Defend your App against failures in a (micro) services ...
Failure Self Defense: Defend your App against failures in a (micro) services ...Failure Self Defense: Defend your App against failures in a (micro) services ...
Failure Self Defense: Defend your App against failures in a (micro) services ...
 
Zabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshopZabbix Smart problem detection - FISL 2015 workshop
Zabbix Smart problem detection - FISL 2015 workshop
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Practical non blocking microservices in java 8
Practical non blocking microservices in java 8Practical non blocking microservices in java 8
Practical non blocking microservices in java 8
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Nagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
Nagios Conference 2007 | Nagios in very large Environments by Werner NeunteuflNagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
Nagios Conference 2007 | Nagios in very large Environments by Werner Neunteufl
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Reactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and DockerReactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and Docker
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
Monitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachineMonitoring with Syslog and EventMachine
Monitoring with Syslog and EventMachine
 
Cisco Router Security
Cisco Router SecurityCisco Router Security
Cisco Router Security
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
FME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-timeFME Server Meets the Challenge of Real-time
FME Server Meets the Challenge of Real-time
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 

Monitoring patterns for mitigating technical risk

Editor's Notes

  1. riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting. explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
  2. riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting. explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
  3. riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting. explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
  4. riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting. explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
  5. riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting. explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
  6. On the left: example for cloudwatch alert for the load balancer detecting REST API error 500 (internal server error) On the right: example for escalation policies. alert for a system test (probe) that failed.
  7. On the left: example for cloudwatch alert for the load balancer detecting REST API error 500 (internal server error) On the right: example for escalation policies. alert for a system test (probe) that failed.
  8. PagerDuty has an alerts panel. An alert is either triggered or resolved. PD assigns the alert to whoever is on duty and then notifies him based on predefined rules (for example call if the alert has not been resolved for 15 minutes). However, PagerDuty API has a throttler. We cannot send every test result to PD every time the test runs. In essence we want to replicate the riemann state into PD. Only when the riemann state changes (a test that previously passed now fails or vica versa) we update PD. Riemann makes it easy to define a state machine by storing the last's event per [host+service] combination.
  9. PagerDuty has an alerts panel. An alert is either triggered or resolved. PD assigns the alert to whoever is on duty and then notifies him based on predefined rules (for example call if the alert has not been resolved for 15 minutes). However, PagerDuty API has a throttler. We cannot send every test result to PD every time the test runs. In essence we want to replicate the riemann state into PD. Only when the riemann state changes (a test that previously passed now fails or vica versa) we update PD. Riemann makes it easy to define a state machine by storing the last's event per [host+service] combination.
  10. But what happens if we manually resolved the alert on PD and the test keeps failing? You would want the alert to be repoened again. This means that we need to filter passed test events using state machine, but not filter failed tests at all. We just throttle them to no more than 1 per minute per test permutation [host service] combination.
  11. But what happens if we manually resolved the alert on PD and the test keeps failing? You would want the alert to be repoened again. This means that we need to filter passed test events using state machine, but not filter failed tests at all. We just throttle them to no more than 1 per minute per test permutation [host service] combination.
  12. filtering widgets that were not been possible without riemann enrichmenent. The quick filters allow fast zoom-in on the branch you are working on, and werether you would like to see test probes or not.