SlideShare a Scribd company logo
1 of 108
Alerts Overload
How to adopt a microservices
architecture without being
overwhelmed with noise
Sarah Wells
@sarahjwells
Microservices make it worse
microservices (n,pl): an efficient device for
transforming business problems into distributed
transaction problems
@drsnooks
You have a lot more systems
45 microservices
45 microservices
3 environments
45 microservices
3 environments
2 instances for each service
45 microservices
3 environments
2 instances for each service
20 checks per service
45 microservices
3 environments
2 instances for each service
20 checks per service
running every 5 minutes
> 1,500,000 system checks
per day
Over 19,000 system
monitoring alerts in 50 days
Over 19,000 system
monitoring alerts in 50 days
An average of 380 per day
Functional monitoring is also an issue
12,745 response time/error
alerts in 50 days
12,745 response time/error
alerts
An average of 255 per day
Why so many?
http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts
How can you make it better?
Quick starts: attack your problem
See our EngineRoom blog for more:
http://bit.ly/1PP7uQQ
1 2 3
Think about monitoring from the start
1
It's the business functionality you care about
1
2
1
3
1
2
4
1
2
3
We care about whether published content made it to us
When people call our APIs, we care about speed
… we also care about errors
But it's the end-to-end that matters
https://www.flickr.com/photos/robef/16537786315/
You only want an alert where you need to take
action
If you just want information, create a dashboard or report
Make sure you can't miss an alert
Make the alert great
http://www.thestickerfactory.co.uk/
Build your system with support in mind
Transaction ids tie all microservices together
Healthchecks tell you whether a service is OK
GET http://{service}/__health
Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
each check will return "ok": true or "ok": false
Synthetic requests tell you about problems early
https://www.flickr.com/photos/jted/5448635109
Use the right tools for the job
2
There are basic tools you need
FT Platform: An internal PaaS
Service monitoring (e.g. Nagios)
Log aggregation (e.g. Splunk)
Graphing (e.g. Graphite/Grafana)
metrics:
reporters:
- type: graphite
frequency: 1 minute
durationUnit: milliseconds
rateUnit: seconds
host: <%= @graphite.host %>
port: 2003
prefix: content.<%= @config_env %>.api-policy-component.<%=
scope.lookupvar('::hostname') %>
Real time error analysis (e.g. Sentry)
Build other tools to support you
SAWS
Built by Silvano Dossan
See our Engine room blog: http://bit.ly/1GATHLy
"I imagine most people do exactly
what I do - create a google filter to
send all Nagios emails straight to the
bin"
"Our screens have a viewing angle of
about 10 degrees"
"Our screens have a viewing angle of
about 10 degrees"
"It never seems to show the page I
want"
Code at: https://github.com/muce/SAWS
Dashing
Nagios chart
Built by Simon Gibbs
@simonjgibbs
Use the right communication channel
It's not email
Slack integration
Radiators everywhere
Cultivate your alerts
3
Review the alerts you get
If it isn't
helpful, make
sure you don't
get sent it
again
See if you can improve it
www.workcompass.com/
Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
…
Technical Impact
The server is experiencing service degradation because of
network latency, high publishing load, high bandwidth
utilization, excessive memory or cpu usage on the VM. This
might result in failure to publish articles to the new content
platform.
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
When you didn't get an alert
What would have told you about this?
Setting up an alert is part of fixing the problem
✔ code
✔ test
alerts
System boundaries are more difficult
Severin.stalder [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via
Make sure you would know if an alert stopped
working
Add a unit test
public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {
…
}
Deliberately break things
Chaos snail
The thing that sends you alerts need to be up and running
https://www.flickr.com/photos/davidmasters/2564786205/
What's happened to our alerts?
We turned off ALL emails from
system monitoring
Our two most important alerts
come in via our team slack
channel
We have dashboards for
our read APIs in Grafana
To summarise...
Build microservices
1 2 3
About technology at the FT:
Look us up on Stack Overflow
http://bit.ly/1H3eXVe
Read our blog
http://engineroom.ft.com/
The FT on github
https://github.com/Financial-Times/
https://github.com/ftlabs
Questions?

More Related Content

What's hot

Secure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksSecure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksNicolas Vivet
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
Экспресс-анализ вредоносов / Crowdsourced Malware Triage
Экспресс-анализ вредоносов / Crowdsourced Malware TriageЭкспресс-анализ вредоносов / Crowdsourced Malware Triage
Экспресс-анализ вредоносов / Crowdsourced Malware TriagePositive Hack Days
 
Attack-driven defense
Attack-driven defenseAttack-driven defense
Attack-driven defenseZane Lackey
 
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)Yan Cui
 
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJSFestUA
 
Continuous Security - TCCC
Continuous Security - TCCCContinuous Security - TCCC
Continuous Security - TCCCWendy Istvanick
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop Splunk
 
6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi   6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi OdessaJS Conf
 
Conf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuConf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuSplunk
 
Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014chrissanders88
 
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of VulnerabilitiesDrupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of Vulnerabilitieszekivazquez
 

What's hot (13)

Secure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksSecure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit Hooks
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Экспресс-анализ вредоносов / Crowdsourced Malware Triage
Экспресс-анализ вредоносов / Crowdsourced Malware TriageЭкспресс-анализ вредоносов / Crowdsourced Malware Triage
Экспресс-анализ вредоносов / Crowdsourced Malware Triage
 
Attack-driven defense
Attack-driven defenseAttack-driven defense
Attack-driven defense
 
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
 
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
 
Continuous Security - TCCC
Continuous Security - TCCCContinuous Security - TCCC
Continuous Security - TCCC
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop
 
6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi   6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi
 
Conf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuConf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsu
 
Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014Applied Detection and Analysis Using Flow Data - MIRCon 2014
Applied Detection and Analysis Using Flow Data - MIRCon 2014
 
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of VulnerabilitiesDrupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
 

Similar to Velocity 2015 Amsterdam: Alerts overload

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxC4Media
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopSplunk
 
Architecture: Manual vs. Automation
Architecture: Manual vs. AutomationArchitecture: Manual vs. Automation
Architecture: Manual vs. AutomationAmazon Web Services
 
Are you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAre you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAPNIC
 
Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Barry Greene
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Amazon Web Services
 
Building Automated Infrastructures
Building Automated InfrastructuresBuilding Automated Infrastructures
Building Automated Infrastructureselliando dias
 
Building An Automated Infrastructure
Building An Automated InfrastructureBuilding An Automated Infrastructure
Building An Automated Infrastructureelliando dias
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...NoNameCon
 
IRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET Journal
 
Penetration testing using metasploit framework
Penetration testing using metasploit frameworkPenetration testing using metasploit framework
Penetration testing using metasploit frameworkPawanKesharwani
 

Similar to Velocity 2015 Amsterdam: Alerts overload (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM Workshop
 
Architecture: Manual vs. Automation
Architecture: Manual vs. AutomationArchitecture: Manual vs. Automation
Architecture: Manual vs. Automation
 
Are you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAre you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security Checklist
 
Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
 
Building Automated Infrastructures
Building Automated InfrastructuresBuilding Automated Infrastructures
Building Automated Infrastructures
 
Building An Automated Infrastructure
Building An Automated InfrastructureBuilding An Automated Infrastructure
Building An Automated Infrastructure
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Itech 1005
Itech 1005Itech 1005
Itech 1005
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...
Nazar Tymoshyk et al - Night in Defense Workshop: Hunting for a needle in a h...
 
Butler
ButlerButler
Butler
 
IRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit Framework
 
Penetration testing using metasploit framework
Penetration testing using metasploit frameworkPenetration testing using metasploit framework
Penetration testing using metasploit framework
 

Recently uploaded

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Recently uploaded (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Velocity 2015 Amsterdam: Alerts overload

Editor's Notes

  1. Two years ago, I started working on a new project at the FT, rebuilding our content platform and APIs. We're using a microservice architecture. I'm here to talk about what it's like to move from monitoring a monolithic application to monitoring a whole lot of microservices. Which is also about what it's like to start doing devops, because when you are building new microservices whenever you need, and throwing them away when they stop being useful, you can't do a handover to a separate operations team each time: it takes too long. So you are going to be supporting your services and the pain that used to be felt by operations when you didn't get monitoring and alerting right, is now being felt by you…
  2. I'm guessing a lot of people in this room have been on a support mailing list at some point, so this probably looks familiar. Too many emails, and very hard to work out what they really mean. The bad news is ...
  3. I saw this recently and it made me laugh. BUT - there are lots of things I really like about microservices! It's easy to reason about the logic within a microservice it's easier to deploy small changes both quickly and reversibly, it's easy to change your architecture, and once you have, it's easy to remove the code you don't need any more, because it's all in one service and you can check that nothing is calling it via the access logs for the service… So I don't want to go back to writing monolithic applications - but I do think that monitoring is harder for a microservice architecture. So why is that?
  4. Firstly, instead of 1 service, we have 45
  5. We currently have Integration, Test and Production environments. There's some debate about whether we need three and other teams at the FT only have production
  6. We have at least 2 instances, for resilience, and sometimes more. And at the moment, each of those is on it's own VM
  7. These are system checks - disk space, CPU load, NTP, DNS
  8. Most of the checks run more often than every 5 minutes in fact
  9. Which means you get alerts for unlikely and transient issues all the time. Earlier this year, a new developer joined our team, and he couldn't believe the number of alert emails we were getting. He started counting.
  10. And that's on average. When shared infrastructure goes wrong, for example if system time isn't being properly synchronised or someone accidentally switched off a DNS server, if you're monitoring it from every server EVERYTHING lights up As an example, we use puppet to automate server setup and deployment - and we had 20000 alert emails overnight for a PLANNED failover of our puppet master But it's not just system monitoring that is painful...
  11. We started out creating alerts and monitoring a lot like we did for monolithic applications: alerts based on response time alerts for ERROR logs or responses that are server error status codes - 500s for example
  12. First off, where in a monolith you were calling a function, now you're making an http request which means there are more things that can go wrong
  13. If one thing fails...
  14. You'll get an alert from the service using it...
  15. But if you're naive in the way you set up alerts, you'll also get an alert from anything calling THAT service Getting alerts from multiple services can also make it difficult to find the cause And when things DO go wrong...
  16. This is what it feels like …
  17. You need to be able to support your system, which means you need to sort out your monitoring and alerting. ... It was clear this was causing us problems, especially when we looked at the numbers : with the system and functional monitoring alerts added together, that's one every 5 minutes so with the support of our Product Owner, we took some time to work on this.
  18. We have a thing we do at the FT called a Quickstart - we take a small team, maybe from several different projects or skillsets, and we put them in a room together No specific requirements, no backlog - just a topic of interest. From feedback, it's apparently very important that free coffee and biscuits get delivered twice a day… In this case - we focussed on alerts and how to make them more useful and rescue our email inboxes (There's more details on this on our Technology blog, the Engine room)
  19. As a result of this I can tell you about three principles that helped us to reduce the number of alerts and spend less time responding to false alarms and confusing information
  20. We got some things right, and I'll cover those later What we got wrong is that we created far too many alerts without thinking about why we were doing it… it was just another thing on the checklist - create an alert. The problem is, you probably don't care about these alerts. I mean, how much do you care about NTP issues in non-production environments? But more importantly, you don't care about response times or errors where a service is just passing on what it got from lower down the stack
  21. 27. It's the business functionality you care about Not the individual microservice.
  22. For example, we are responsible for publishing FastFT posts - if that widget on the right on our site home page stops getting the latest updates, we will hear about it So that's what our alerts should be focussing on So to tell you what's important to us, I need to tell you a bit about our system...
  23. This is a logical view of the Universal Publishing Platform multiple source content management systems, sending us articles, blogs, images, vidoes etc when content is published, it's transformed into a common format and annotated using a concept extraction pipeline we also have metadata taxonomies like organisations, people, memberships, all loaded in then there are APIs to get content and metadata about content articles about Apple -> Information about Apple -> Information about Tim Cook -> Other companies he's involved with, etc. etc. etc Architecturally, we have a mix of Go and Java/Dropwizard apps. We use Kafka to send messages about events. We have GraphDB and Mongo data stores. So what is our key business functionality?
  24. 1. Publishing and transforming content
  25. 2. Annotating that content - i.e. working out which companies an article mentions, or what person it's about
  26. 3. Loading updates of our data about organisations, people, etc
  27. 4. Making all that information available via APIs But it's not the same things we care about for each...
  28. We want to know about every failure, because each failure is a story that our customers can't read yet Our alert should make it clear we've failed to publish something, AND what needs to be done to fix it
  29. For publication, there aren't that many events a day - maybe 600. We can look at individual events. For our APIs, we have 2.8 million requests a day at the moment, a little over 30 a second. So we look at 95th and 99th percentile response time, for example, to make sure they're ok. It doesn't have to be super fast, but it definitely can't be super slow But we don't JUST care about speed...
  30. i.e. did something go wrong. The obvious thing to look for is server errors - something has gone wrong somewhere in our stack. The graph here shows when some of our blades failed in a data centre. This is for some business functionality that's not critical at the moment, so we are comfortable with all the nodes being in the same data centre, in case you're wondering why a blade failure would break things! The sudden increase in 500 errors triggered our alerts so we knew about this really quickly. However, we also look for client errors a sudden increase in 400 errors, i.e. bad requests, could be your fault. We've made changes that turned out to break our API contract - e.g. POST requests suddenly needed to have Content-Type header application/json. Meets http spec, but is less lenient, and so BAD. We would want an alert for that.
  31. We have built in back off and retry for recoverable errors Sometimes the first request fails, and the second one succeeds. We don't want an alert in that case. We might want a report, so we know we have a flaky connection. Or we might just accept that our network is evil.
  32. Otherwise, it's just noise Your alerts should be something you don't mind being interrupted about
  33. You can go look at it whenever you want. I bet you won't look at it as often as you think you will We got rid of: all our publish microservice-specific response time alerts all our microservice-specific error alerts and made the most interesting ones into real-time dashboards
  34. Now your alerts really mean you need to react, make them unmissable. This means they need to attract the attention of the people that need to react. How you do that depends on your team and your working practices We have an 'Ops Cop', and take it in turns to do that role for a week. The ops cop will also take on small pieces of work, tidying up, refactoring - things that don't need you to be in flow (because you WILL get interrupted)
  35. Anyone reading the alert should be able to work out: what it actually means the action they need to take who to talk to if they get stuck Use clear language and don't be vague. Add a link to explanatory information (panic guide) - this needs to be clear too, and needs to be reviewed by someone who may have to use it but didn't write the service (e.g. new team members who've never had to look at this service/operations) Consider how to make "future you's" life easier: here's a search link to show you the whole transaction here's a jenkins job to republish
  36. Our transaction IDs are adding to logs using MDC (Mapped Diagnostic Context) Every microservice we write needs to checked for a special X-Request-Id header (we do this via a Servlet Filter) then add it to the thread context. Any requests over http must pass on the X-Request-Id header too.
  37. This means all logs for a particular user request will have a unique identifier logged and we can look at everything that happened when an article was published or a read request was made
  38. We have an FT standard for healthchecks - you must return a particular json response on a particular endpoint.
  39. You return 200 for unhealthy as well - there was some debate about this, the logic is that a 500 indicates that the healthcheck can't be run, which is different from it failing
  40. You have to look at each check to work out whether you have any failing checks
  41. This is what the json looks like
  42. There's a chrome plugin to make it look nicer for humans
  43. You want to know about problems before they affect your customers, if possible. We started off with synthetic publication requests. Synthetic publication takes a known, old article, and publishes it every minute. If this breaks, we can fix it before a single real publish fails.
  44. By basic, I mean standard
  45. A puppet based framework goal: for developers to reliably build & deploy services* from "zero-to-customer" in less than 15mins. ... across data centres, with monitoring... supports multiple IaaS providers digression: some debate about FT Platform internally, some teams aren't using it: heroku or 'naked' AWS personal opinion: bootstrapped this type of deployment at the FT, and at the time most developers weren't that familiar with the underlying tools, but if you are already familiar with heroku and AWS, it can feel like you're being restricted we're now evolving FT platform to reflect that, with a move to CloudFormation and an internal tool called Konstructor that provides an API wrapper round a lot of our other tools however: gave us monitoring and log aggregation for any new microservice with no additional effort
  46. nagios monitors system metrics, network protocols, applications, services, servers, and network infrastructure alerts via email or (god forbid) SMS when there are failures and when the service recovers you can acknowledge alerts to stop the notifications put into maintenance mode for known downtimes
  47. Every VM set up using FT Platform automatically forwards logs to Splunk. Any queries you want to do across all hosts in a service, or all services that take part in a particular event is easy to do without having to jump onto the relevant box We use it to identify problems and alert visualise performance or load create dashboards for particular services But more recently, we're moving away from Splunk dashboards..
  48. And instead we're graphing our metrics using Graphite and Grafana. We're using Dropwizard for our Java apps and that comes with codahale metrics embedded. It's a small config change to write those metrics to a graphite server...
  49. Graphite isn't particularly pretty - you can see all the metrics and compose graphs on the fly... ... but by using Grafana on top of it you can easily create beautiful custom dashboards...
  50. .They're quick to load as well This shows one of our Read API components, so we're interested in server errors, client errors, successful requests And also request rate across hosts. Interesting - here, the traffic started to switch over from one data centre to another, I have no idea why!
  51. We were using Splunk to pick up ERROR level logs The problem there is that every ERROR results in an alert. You might be more hardcore about this than me, but unless you have zero tolerance of ERROR logs, there will be times when there are some errors that aren't a priority - they don't represent a major issue and there aren't that many We got some of those from the client we use to talk to Kafka We were ignoring them and missed a problem someone introduced that also caused ERROR logs. That wouldn't happen in Sentry or equivalent tools, because each new error TYPE results in an alert. Again, sending information for a Dropwizard app to Sentry is a simple configuration to send logs out to the sentry API OK, so that's the basic tools...
  52. If the basic tools aren't giving you what you need, build your own. This is easier if those basic tools have good APIs - because you can create your own view easily Our first 'extra' tool was created by one of our integration engineers - he turned up with it one day…
  53. SAWS Built using Blinky tape - a programmable LED strip Each section represents a different part of our system Things light up when there's a problem, and when there isn't a problem, the blue lights swoosh back and forth so you know the monitoring is still running. It used to be really cool and run on a Raspberry Pi - it's a Python script - but that broke and now it runs on an old Windows box under someone's desk. So why did Silvano create this? First off, frustrations with the number of emails...
  54. Which he was sending straight to the bin...
  55. And secondly, frustration with monitoring screens
  56. He wanted something that was easy to instantly see if there was a problem
  57. This is SAWS up in our office. It's pretty simple - red indicates something bad has happened. and he also changed from green to blue after this to make sure everyone can see if there's a problem… It's not really this bright :) So that was our first tool. Our second tool addresses the problem of waiting for screens to cycle through to see the one you want to see - by providing a single screen that can tell you what you need to know...
  58. Dashing is a Sinatra based framework that lets you build beautiful dashboards. Originally built by Shopify for showing things on monitors around the office Adopted by the FT - lots of things we care about are very easy to add as tiles: nagios (monitoring) jenkins (build and deployment) pingdom (website monitoring) And it's not hard to add a new widget to integrate another system. This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
  59. And this is the FT's dashboard of everything We have a duty ops team who are first line support. They using Dashing very heavily and they'll ask things not surfaced on dashing to be added These tiles are arranged by service level, so the most critical systems are top right, with a platinum border Bottom right have a bronze border - these are much more a case of 'best endeavours' We have dashing screens up in our area now - it's enough to let you know there's an issue, and it can give a bit more granularity than the big flashing light thing … What dashing and SAWS don't give us though, is any history. Meaning - we have no idea what happened when we weren't watching it So another member of my team started working on something to give us that...
  60. Nagios chart gives us the last 24 hours history for each Nagios monitor. Means if we have intermittent errors that happen a lot, we don't miss them. And if something big happens when we're not there, we still know about it So how does it work?
  61. It screenscrapes Nagios for status - this is what that information looks like on nagios. Nagios chart pings this regularly and keeps the information in memory for 24 hours (we go back that far as it lets us see what happened overnight, plus that was the limit before having to store it somewhere other than memory)
  62. Each line is a service - in this case, it's all the services in Production on AWS for one of our teams The name of the service, and of each check that failed, are shown on the left. The bars on the right show the status at any point. All failures are 'soft' failures - e.g. we don't wait for 3 failures to happen before indicating there was a problem. This allows us to see intermittent issues (but probably results in some noise) YELLOW: WARNING status - a minor failure - e.g. a check took slightly over the max time to respond RED: CRITICAL status - a major failure, i.e. no response for a check BLUE: ACKED state So here you can see a large data load happening that put strain onto all our servers - they were in a flapping state for hours. At some point, people started acknowledging the alerts
  63. This one is worse. We had major problems in our Test environment - our graph database fell over. everything that had anything to do with graphs pretty much went down. As it's Test, there was less acknowledging going on
  64. Here's two major problems, one after the other - the pink vertical lines show when nagios chart couldn't connect to nagios, this was down to packet loss on our network. The red bars were a firewall upgrade, eventually rolled back. Again, this is Test. Nagios chart works because it uses the human ability to make sense of patterns - we generally know when things are going wrong just out of the corner of our eye If viewed on your browser, pixel mapping takes you RIGHT to the error in nagios It's been successful - individual teams picked it up and it's been adopted by our platform and environments team, to make it available more generally at the FT. If it sounds interesting, let me know - it's not open sourced yet. … So the final comment on tools is about the tools you use for communication...
  65. That's probably a bit harsh.. But it's certainly not email for me. Even if you get the numbers down to a manageable level, threaded view isn't good for alerts - and it's hard to work out what they mean from this view (I realised after I took this screenshot that these aren't even alerts for my system - another team copied config and sent us all their alerts for a while) And we are moving away from email for team communication at the FT…
  66. We're using Slack a lot - most people have a Slack client open. Slack has great integration tools webhooks let you call an http endpoint and post a message email integration fits well with existing tools - anything that can send an email can send a Slack message One of my colleagues tried to persuade me to set up a separate channel for our alerts, not using the main team channel. I think that's effectively saying "Put it somewhere where I can ignore it" If you are getting so many of these alerts that it's annoying, there are two things you can do: tune the alert (e.g. API requests, increased number of failures in a ten minute period so we tend to get this alert for real issues not network blips) fix your broken system
  67. One other thing I'm trying to persuade people to do is use Slack reactions to show that you've picked up an alert, and fixed it I read that editorial teams are using Slack like this to move content through a workflow. We tend to reach with a tick where we fixed something, with 'eyes' if we're looking into it still But the problem I have is the creativity of developers - I have to ask people what they mean by a dancing lady...
  68. If you put screens up that are clear in what they are showing - you'll notice when things go wrong Non-developers on the team will also notice and tell you something's started flashing Don't loop between screens - put something up that tells you what you need to know. Have more than one screen!
  69. You have to keep a focus on them or they start to get untidy
  70. Did you do something as a result of getting it? If no, delete it
  71. Language should be clear - avoid jargon Get rid of typos Link to useful documentation Get your newest developer to read it Get someone from another team to read it
  72. This is text for an email alert based on looking at access log response times First of all - what a really developer title for the alert: no spaces, categorised by how often it runs rather than what it means
  73. Next up - it MIGHT result in articles not getting published. I want to know if it DID result in articles not getting published. Also - the business doesn't care about my Methode API microservice (which is a microservice wrapping calls over CORBA so that most people don't have to deal with CORBA) But our alerts also have a Technical impact section...
  74. I have no idea why we decided these and only these were the reasons for slow response times. It doesn't help me work out which of these is currently the issue.
  75. First of all - spaces in the title! This is better - at least I can tell that it's a publish failure, from our Methode CMS.
  76. And I can see which articles failed.
  77. And I can go and look at a run book for more information - in fact, the run book links to somewhere (actually a Jenkins job) where you can enter the list of UUIDs and kick off a republish process. (yes, could be automated, but sometimes you want to check it's not going to fail the second time, e.g. editors use their systems in ways we didn't predict) All of which make it much less annoying to have to deal with an alert. This alert goes to some people in our editorial department, so they can check status and republish So whenever you get an alert, really look at it
  78. If someone had to come and tell you your system is broken, you probably need to find a way to know first the next time Although… for some things, a slack channel that people know about is pretty good
  79. Maybe you need to create a synthetic request, or add the right logs and create a Splunk alert
  80. Here - something that picked up when the percentage of failures increased told us we had a problem
  81. We've had a case where the integration that tells us when an article is published broke Our monitoring starts from that notification We found out via manual testing 3 days later We asked the CMS team to add their own monitoring - but we also added a brute force test ourselves - "did we see any blog publishes in the last day?"
  82. I managed to turn off our publication failure alerts because I "improved" some logging We worked this out when part of our data centre went down but we didn't see these alerts firing
  83. If your log entry is the basis for an alert, add a unit test that will fail if it's changed and explicitly says what the impact is
  84. Maybe you take down one of your systems and check that you can tell the impact from the alerts you get Maybe do this blind and see how quickly the Ops Cop can work out what's broken (we haven't done this but I'd like to) Or you can take part in company exercises - the FT took down one of our data centres earlier this year. We'd built up to it with smaller tests, did it an an agreed date, and made sure the right people were available. Crucially, every issue we found was worked through. We turned off a different data centre last weekend. For us as developers, a few weeks before we started thinking about what might happen. We KNEW we didn't have resilience for one part of our system as part of a phased approach to delivery. However, when we started to think about what was going to happen, we found several unexpected reasons why we weren't going to have a working system (bad configuration, mostly) - we had those fixed before the day.
  85. Netflix have their Chaos Monkey for testing resilience by randomly killing instances and services (in fact they have an entire Simian Army to test resilience at different levels) The FT has its own Chaos Snail. If you're wondering why it's called that, it's smaller-scale than the chaos monkey, and it's written in shell This runs on a virtual machine, kills processes as root, and records its work. It's a good way to see if your alerts are working.
  86. It needs to be at least as available as the system it's monitoring This is something that took us a while to really get to grips with. But if the monitoring system is down, you have no idea what the state of your system is. … So that's it from me in terms of advice, so I guess the question is...
  87. Zero emails from Nagios - we have our inbox back! We rely on our other tools
  88. We can't miss them. They are genuine alerts
  89. So we can see how we're doing on response times and error rates at any point
  90. There are lots of good reasons to do that But realise what it means to support them
  91. Think about it from the start Make sure you have the right tools Continue to cultivate your alerts
  92. Our company page on Stack overflow describes our technologies and the culture of our Technology department We also have a technology blog where we talk about some of the things we're trying out
  93. We have lots of our code on github and are doing this more and more