SlideShare a Scribd company logo
1 of 109
MILAN 20/21.11.2015
Alert overload: How to adopt a
microservices architecture without being
overwhelmed with noise
Sarah Wells - Financial Times
@sarahjwells
Microservices make it worse
microservices (n,pl): an efficient device for
transforming business problems into distributed
transaction problems
@drsnooks
You have a lot more systems
45 microservices
45 microservices
3 environments
45 microservices
3 environments
2 instances for each service
45 microservices
3 environments
2 instances for each service
20 checks per service
45 microservices
3 environments
2 instances for each service
20 checks per service
running every 5 minutes
> 1,500,000 system checks
per day
Over 19,000 system
monitoring alerts in 50 days
Over 19,000 system
monitoring alerts in 50 days
An average of 380 per day
Functional monitoring is also an issue
12,745 response time/error
alerts in 50 days
12,745 response time/error
alerts
An average of 255 per day
Why so many?
http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts
How can you make it better?
Quick starts: attack your problem
See our EngineRoom blog for more:
http://bit.ly/1PP7uQQ
1 2 3
Think about monitoring from the start
1
It's the business functionality you care about
1
2
1
3
1
2
4
1
2
3
We care about whether published content made it to us
When people call our APIs, we care about speed
… we also care about errors
But it's the end-to-end that matters
https://www.flickr.com/photos/robef/16537786315/
You only want an alert where you need to take
action
If you just want information, create a dashboard or report
Make sure you can't miss an alert
Make the alert great
http://www.thestickerfactory.co.uk/
Build your system with support in mind
Transaction ids tie all microservices together
Healthchecks tell you whether a service is OK
GET http://{service}/__health
Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
each check will return "ok": true or "ok": false
Synthetic requests tell you about problems early
https://www.flickr.com/photos/jted/5448635109
Use the right tools for the job
2
There are basic tools you need
FT Platform: An internal PaaS
Service monitoring (e.g. Nagios)
Log aggregation (e.g. Splunk)
Graphing (e.g. Graphite/Grafana)
metrics:
reporters:
- type: graphite
frequency: 1 minute
durationUnit: milliseconds
rateUnit: seconds
host: <%= @graphite.host %>
port: 2003
prefix: content.<%= @config_env %>.api-policy-component.<%=
scope.lookupvar('::hostname') %>
Real time error analysis (e.g. Sentry)
Build other tools to support you
SAWS
Built by Silvano Dossan
See our Engine room blog: http://bit.ly/1GATHLy
"I imagine most people do exactly
what I do - create a google filter to
send all Nagios emails straight to the
bin"
"Our screens have a viewing angle of
about 10 degrees"
"Our screens have a viewing angle of
about 10 degrees"
"It never seems to show the page I
want"
Code at: https://github.com/muce/SAWS
Dashing
Nagios chart
Built by Simon Gibbs
@simonjgibbs
Use the right communication channel
It's not email
Slack integration
Radiators everywhere
Cultivate your alerts
3
Review the alerts you get
If it isn't
helpful, make
sure you don't
get sent it
again
See if you can improve it
www.workcompass.com/
Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
…
Technical Impact
The server is experiencing service degradation because of
network latency, high publishing load, high bandwidth
utilization, excessive memory or cpu usage on the VM. This
might result in failure to publish articles to the new content
platform.
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
When you didn't get an alert
What would have told you about this?
Setting up an alert is part of fixing the problem
✔ code
✔ test
alerts
System boundaries are more difficult
Severin.stalder [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via
Wikimedia Commons
Make sure you would know if an alert stopped
working
Add a unit test
public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {
…
}
Deliberately break things
Chaos snail
The thing that sends you alerts need to be up and running
https://www.flickr.com/photos/davidmasters/2564786205/
What's happened to our alerts?
We turned off ALL emails from
system monitoring
Our two most important alerts
come in via our team slack
channel
We have dashboards for
our read APIs in Grafana
To summarise...
Build microservices
1 2 3
About technology at the FT:
Look us up on Stack Overflow
http://bit.ly/1H3eXVe
Read our blog
http://engineroom.ft.com/
The FT on github
https://github.com/Financial-Times/
https://github.com/ftlabs
Thank you!
Questions?

More Related Content

What's hot

Cypress Tech Talk August 4 2015
Cypress Tech Talk August 4 2015Cypress Tech Talk August 4 2015
Cypress Tech Talk August 4 2015dczulada
 
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE - ATT&CKcon
 
SplunkSummit 2015 - Security Ninjitsu
SplunkSummit 2015 - Security NinjitsuSplunkSummit 2015 - Security Ninjitsu
SplunkSummit 2015 - Security NinjitsuSplunk
 
SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop Splunk
 
SplunkSummit 2015 - Splunking the Endpoint
SplunkSummit 2015 - Splunking the EndpointSplunkSummit 2015 - Splunking the Endpoint
SplunkSummit 2015 - Splunking the EndpointSplunk
 
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)Yan Cui
 
Secure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksSecure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksNicolas Vivet
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)Siglos
 
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of VulnerabilitiesDrupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of Vulnerabilitieszekivazquez
 
Attack-driven defense
Attack-driven defenseAttack-driven defense
Attack-driven defenseZane Lackey
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
Microservices 5 things i wish i'd known code motion
Microservices 5 things i wish i'd known   code motionMicroservices 5 things i wish i'd known   code motion
Microservices 5 things i wish i'd known code motionVincent Kok
 
Conf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuConf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuSplunk
 
Logging for Hackers v1.0
Logging for Hackers v1.0Logging for Hackers v1.0
Logging for Hackers v1.0Michael Gough
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as CodeJulien Pivotto
 
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE - ATT&CKcon
 
Dev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and TestingDev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and TestingJason Stanley
 

What's hot (18)

Cypress Tech Talk August 4 2015
Cypress Tech Talk August 4 2015Cypress Tech Talk August 4 2015
Cypress Tech Talk August 4 2015
 
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...
MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...
 
SplunkSummit 2015 - Security Ninjitsu
SplunkSummit 2015 - Security NinjitsuSplunkSummit 2015 - Security Ninjitsu
SplunkSummit 2015 - Security Ninjitsu
 
SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop SplunkSummit 2015 - ES Hands On Workshop
SplunkSummit 2015 - ES Hands On Workshop
 
SplunkSummit 2015 - Splunking the Endpoint
SplunkSummit 2015 - Splunking the EndpointSplunkSummit 2015 - Splunking the Endpoint
SplunkSummit 2015 - Splunking the Endpoint
 
The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)The present and future of Serverless observability (Serverless Computing London)
The present and future of Serverless observability (Serverless Computing London)
 
Secure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit HooksSecure your Web Application With The New Python Audit Hooks
Secure your Web Application With The New Python Audit Hooks
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of VulnerabilitiesDrupal Dev Days 2018 - Autopsy of Vulnerabilities
Drupal Dev Days 2018 - Autopsy of Vulnerabilities
 
Attack-driven defense
Attack-driven defenseAttack-driven defense
Attack-driven defense
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Microservices 5 things i wish i'd known code motion
Microservices 5 things i wish i'd known   code motionMicroservices 5 things i wish i'd known   code motion
Microservices 5 things i wish i'd known code motion
 
Conf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsuConf2014_SplunkSecurityNinjutsu
Conf2014_SplunkSecurityNinjutsu
 
Logging for Hackers v1.0
Logging for Hackers v1.0Logging for Hackers v1.0
Logging for Hackers v1.0
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as Code
 
Purple team is awesome
Purple team is awesomePurple team is awesome
Purple team is awesome
 
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...
MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...
 
Dev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and TestingDev Talk: Event Manipulation and Testing
Dev Talk: Event Manipulation and Testing
 

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopSplunk
 
Are you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAre you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAPNIC
 
Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Barry Greene
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxC4Media
 
Architecture: Manual vs. Automation
Architecture: Manual vs. AutomationArchitecture: Manual vs. Automation
Architecture: Manual vs. AutomationAmazon Web Services
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Amazon Web Services
 
IRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET Journal
 
Penetration testing using metasploit framework
Penetration testing using metasploit frameworkPenetration testing using metasploit framework
Penetration testing using metasploit frameworkPawanKesharwani
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
SplunkLive! Splunk App for VMware
SplunkLive! Splunk App for VMwareSplunkLive! Splunk App for VMware
SplunkLive! Splunk App for VMwareSplunk
 
The Present and Future of Serverless Observability
The Present and Future of Serverless ObservabilityThe Present and Future of Serverless Observability
The Present and Future of Serverless ObservabilityC4Media
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Finding attacks with these 6 events
Finding attacks with these 6 eventsFinding attacks with these 6 events
Finding attacks with these 6 eventsMichael Gough
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise (20)

Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Analytics Driven SIEM Workshop
Analytics Driven SIEM WorkshopAnalytics Driven SIEM Workshop
Analytics Driven SIEM Workshop
 
Are you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security ChecklistAre you ready for the next attack? Reviewing the SP Security Checklist
Are you ready for the next attack? Reviewing the SP Security Checklist
 
Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...Are you ready for the next attack? reviewing the sp security checklist (apnic...
Are you ready for the next attack? reviewing the sp security checklist (apnic...
 
Modern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a FoxModern Web Security, Lazy but Mindful Like a Fox
Modern Web Security, Lazy but Mindful Like a Fox
 
Architecture: Manual vs. Automation
Architecture: Manual vs. AutomationArchitecture: Manual vs. Automation
Architecture: Manual vs. Automation
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Itech 1005
Itech 1005Itech 1005
Itech 1005
 
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...
 
Butler
ButlerButler
Butler
 
IRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET- A Study on Penetration Testing using Metasploit Framework
IRJET- A Study on Penetration Testing using Metasploit Framework
 
Penetration testing using metasploit framework
Penetration testing using metasploit frameworkPenetration testing using metasploit framework
Penetration testing using metasploit framework
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
SplunkLive! Splunk App for VMware
SplunkLive! Splunk App for VMwareSplunkLive! Splunk App for VMware
SplunkLive! Splunk App for VMware
 
The Present and Future of Serverless Observability
The Present and Future of Serverless ObservabilityThe Present and Future of Serverless Observability
The Present and Future of Serverless Observability
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Finding attacks with these 6 events
Finding attacks with these 6 eventsFinding attacks with these 6 events
Finding attacks with these 6 events
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 

More from Codemotion

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Codemotion
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyCodemotion
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaCodemotion
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserCodemotion
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Codemotion
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Codemotion
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Codemotion
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 - Codemotion
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Codemotion
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Codemotion
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Codemotion
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Codemotion
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Codemotion
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Codemotion
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...Codemotion
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Codemotion
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Codemotion
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Codemotion
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Codemotion
 

More from Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 

Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Editor's Notes

  1. Two years ago, I started working on a new project at the FT, rebuilding our content platform and APIs. We're using a microservice architecture. I'm here to talk about what it's like to move from monitoring a monolithic application to monitoring a whole lot of microservices. Which is also about what it's like to start doing devops, because when you are building new microservices whenever you need, and throwing them away when they stop being useful, you can't do a handover to a separate operations team each time: it takes too long. So you are going to be supporting your services and the pain that used to be felt by operations when you didn't get monitoring and alerting right, is now being felt by you…
  2. I'm guessing a lot of people in this room have been on a support mailing list at some point, so this probably looks familiar. Too many emails, and very hard to work out what they really mean. The bad news is ...
  3. I saw this recently and it made me laugh. BUT - there are lots of things I really like about microservices! It's easy to reason about the logic within a microservice it's easier to deploy small changes both quickly and reversibly, it's easy to change your architecture, and once you have, it's easy to remove the code you don't need any more, because it's all in one service and you can check that nothing is calling it via the access logs for the service… So I don't want to go back to writing monolithic applications - but I do think that monitoring is harder for a microservice architecture. So why is that?
  4. Firstly, instead of 1 service, we have 45
  5. We currently have Integration, Test and Production environments. There's some debate about whether we need three and other teams at the FT only have production
  6. We have at least 2 instances, for resilience, and sometimes more. And at the moment, each of those is on it's own VM
  7. These are system checks - disk space, CPU load, NTP, DNS
  8. Most of the checks run more often than every 5 minutes in fact
  9. Which means you get alerts for unlikely and transient issues all the time. Earlier this year, a new developer joined our team, and he couldn't believe the number of alert emails we were getting. He started counting.
  10. And that's on average. When shared infrastructure goes wrong, for example if system time isn't being properly synchronised or someone accidentally switched off a DNS server, if you're monitoring it from every server EVERYTHING lights up As an example, we use puppet to automate server setup and deployment - and we had 20000 alert emails overnight for a PLANNED failover of our puppet master But it's not just system monitoring that is painful...
  11. We started out creating alerts and monitoring a lot like we did for monolithic applications: alerts based on response time alerts for ERROR logs or responses that are server error status codes - 500s for example
  12. First off, where in a monolith you were calling a function, now you're making an http request which means there are more things that can go wrong
  13. If one thing fails...
  14. You'll get an alert from the service using it...
  15. But if you're naive in the way you set up alerts, you'll also get an alert from anything calling THAT service Getting alerts from multiple services can also make it difficult to find the cause And when things DO go wrong...
  16. This is what it feels like …
  17. You need to be able to support your system, which means you need to sort out your monitoring and alerting. ... It was clear this was causing us problems, especially when we looked at the numbers : with the system and functional monitoring alerts added together, that's one every 5 minutes so with the support of our Product Owner, we took some time to work on this.
  18. We have a thing we do at the FT called a Quickstart - we take a small team, maybe from several different projects or skillsets, and we put them in a room together No specific requirements, no backlog - just a topic of interest. From feedback, it's apparently very important that free coffee and biscuits get delivered twice a day… In this case - we focussed on alerts and how to make them more useful and rescue our email inboxes (There's more details on this on our Technology blog, the Engine room)
  19. As a result of this I can tell you about three principles that helped us to reduce the number of alerts and spend less time responding to false alarms and confusing information
  20. We got some things right, and I'll cover those later What we got wrong is that we created far too many alerts without thinking about why we were doing it… it was just another thing on the checklist - create an alert. The problem is, you probably don't care about these alerts. I mean, how much do you care about NTP issues in non-production environments? But more importantly, you don't care about response times or errors where a service is just passing on what it got from lower down the stack
  21. 27. It's the business functionality you care about Not the individual microservice.
  22. For example, we are responsible for publishing FastFT posts - if that widget on the right on our site home page stops getting the latest updates, we will hear about it So that's what our alerts should be focussing on So to tell you what's important to us, I need to tell you a bit about our system...
  23. This is a logical view of the Universal Publishing Platform multiple source content management systems, sending us articles, blogs, images, vidoes etc when content is published, it's transformed into a common format and annotated using a concept extraction pipeline we also have metadata taxonomies like organisations, people, memberships, all loaded in then there are APIs to get content and metadata about content articles about Apple -> Information about Apple -> Information about Tim Cook -> Other companies he's involved with, etc. etc. etc Architecturally, we have a mix of Go and Java/Dropwizard apps. We use Kafka to send messages about events. We have GraphDB and Mongo data stores. So what is our key business functionality?
  24. 1. Publishing and transforming content
  25. 2. Annotating that content - i.e. working out which companies an article mentions, or what person it's about
  26. 3. Loading updates of our data about organisations, people, etc
  27. 4. Making all that information available via APIs But it's not the same things we care about for each...
  28. We want to know about every failure, because each failure is a story that our customers can't read yet Our alert should make it clear we've failed to publish something, AND what needs to be done to fix it
  29. For publication, there aren't that many events a day - maybe 600. We can look at individual events. For our APIs, we have 2.8 million requests a day at the moment, a little over 30 a second. So we look at 95th and 99th percentile response time, for example, to make sure they're ok. It doesn't have to be super fast, but it definitely can't be super slow But we don't JUST care about speed...
  30. i.e. did something go wrong. The obvious thing to look for is server errors - something has gone wrong somewhere in our stack. The graph here shows when some of our blades failed in a data centre. This is for some business functionality that's not critical at the moment, so we are comfortable with all the nodes being in the same data centre, in case you're wondering why a blade failure would break things! The sudden increase in 500 errors triggered our alerts so we knew about this really quickly. However, we also look for client errors a sudden increase in 400 errors, i.e. bad requests, could be your fault. We've made changes that turned out to break our API contract - e.g. POST requests suddenly needed to have Content-Type header application/json. Meets http spec, but is less lenient, and so BAD. We would want an alert for that.
  31. We have built in back off and retry for recoverable errors Sometimes the first request fails, and the second one succeeds. We don't want an alert in that case. We might want a report, so we know we have a flaky connection. Or we might just accept that our network is evil.
  32. Otherwise, it's just noise Your alerts should be something you don't mind being interrupted about
  33. You can go look at it whenever you want. I bet you won't look at it as often as you think you will We got rid of: all our publish microservice-specific response time alerts all our microservice-specific error alerts and made the most interesting ones into real-time dashboards
  34. Now your alerts really mean you need to react, make them unmissable. This means they need to attract the attention of the people that need to react. How you do that depends on your team and your working practices We have an 'Ops Cop', and take it in turns to do that role for a week. The ops cop will also take on small pieces of work, tidying up, refactoring - things that don't need you to be in flow (because you WILL get interrupted)
  35. Anyone reading the alert should be able to work out: what it actually means the action they need to take who to talk to if they get stuck Use clear language and don't be vague. Add a link to explanatory information (panic guide) - this needs to be clear too, and needs to be reviewed by someone who may have to use it but didn't write the service (e.g. new team members who've never had to look at this service/operations) Consider how to make "future you's" life easier: here's a search link to show you the whole transaction here's a jenkins job to republish
  36. Our transaction IDs are adding to logs using MDC (Mapped Diagnostic Context) Every microservice we write needs to checked for a special X-Request-Id header (we do this via a Servlet Filter) then add it to the thread context. Any requests over http must pass on the X-Request-Id header too.
  37. This means all logs for a particular user request will have a unique identifier logged and we can look at everything that happened when an article was published or a read request was made
  38. We have an FT standard for healthchecks - you must return a particular json response on a particular endpoint.
  39. You return 200 for unhealthy as well - there was some debate about this, the logic is that a 500 indicates that the healthcheck can't be run, which is different from it failing
  40. You have to look at each check to work out whether you have any failing checks
  41. This is what the json looks like
  42. There's a chrome plugin to make it look nicer for humans
  43. You want to know about problems before they affect your customers, if possible. We started off with synthetic publication requests. Synthetic publication takes a known, old article, and publishes it every minute. If this breaks, we can fix it before a single real publish fails.
  44. By basic, I mean standard
  45. A puppet based framework goal: for developers to reliably build & deploy services* from "zero-to-customer" in less than 15mins. ... across data centres, with monitoring... supports multiple IaaS providers digression: some debate about FT Platform internally, some teams aren't using it: heroku or 'naked' AWS personal opinion: bootstrapped this type of deployment at the FT, and at the time most developers weren't that familiar with the underlying tools, but if you are already familiar with heroku and AWS, it can feel like you're being restricted we're now evolving FT platform to reflect that, with a move to CloudFormation and an internal tool called Konstructor that provides an API wrapper round a lot of our other tools however: gave us monitoring and log aggregation for any new microservice with no additional effort
  46. nagios monitors system metrics, network protocols, applications, services, servers, and network infrastructure alerts via email or (god forbid) SMS when there are failures and when the service recovers you can acknowledge alerts to stop the notifications put into maintenance mode for known downtimes
  47. Every VM set up using FT Platform automatically forwards logs to Splunk. Any queries you want to do across all hosts in a service, or all services that take part in a particular event is easy to do without having to jump onto the relevant box We use it to identify problems and alert visualise performance or load create dashboards for particular services But more recently, we're moving away from Splunk dashboards..
  48. And instead we're graphing our metrics using Graphite and Grafana. We're using Dropwizard for our Java apps and that comes with codahale metrics embedded. It's a small config change to write those metrics to a graphite server...
  49. Graphite isn't particularly pretty - you can see all the metrics and compose graphs on the fly... ... but by using Grafana on top of it you can easily create beautiful custom dashboards...
  50. .They're quick to load as well This shows one of our Read API components, so we're interested in server errors, client errors, successful requests And also request rate across hosts. Interesting - here, the traffic started to switch over from one data centre to another, I have no idea why!
  51. We were using Splunk to pick up ERROR level logs The problem there is that every ERROR results in an alert. You might be more hardcore about this than me, but unless you have zero tolerance of ERROR logs, there will be times when there are some errors that aren't a priority - they don't represent a major issue and there aren't that many We got some of those from the client we use to talk to Kafka We were ignoring them and missed a problem someone introduced that also caused ERROR logs. That wouldn't happen in Sentry or equivalent tools, because each new error TYPE results in an alert. Again, sending information for a Dropwizard app to Sentry is a simple configuration to send logs out to the sentry API OK, so that's the basic tools...
  52. If the basic tools aren't giving you what you need, build your own. This is easier if those basic tools have good APIs - because you can create your own view easily Our first 'extra' tool was created by one of our integration engineers - he turned up with it one day…
  53. SAWS Built using Blinky tape - a programmable LED strip Each section represents a different part of our system Things light up when there's a problem, and when there isn't a problem, the blue lights swoosh back and forth so you know the monitoring is still running. It used to be really cool and run on a Raspberry Pi - it's a Python script - but that broke and now it runs on an old Windows box under someone's desk. So why did Silvano create this? First off, frustrations with the number of emails...
  54. Which he was sending straight to the bin...
  55. And secondly, frustration with monitoring screens
  56. He wanted something that was easy to instantly see if there was a problem
  57. This is SAWS up in our office. It's pretty simple - red indicates something bad has happened. and he also changed from green to blue after this to make sure everyone can see if there's a problem… It's not really this bright :) So that was our first tool. Our second tool addresses the problem of waiting for screens to cycle through to see the one you want to see - by providing a single screen that can tell you what you need to know...
  58. Dashing is a Sinatra based framework that lets you build beautiful dashboards. Originally built by Shopify for showing things on monitors around the office Adopted by the FT - lots of things we care about are very easy to add as tiles: nagios (monitoring) jenkins (build and deployment) pingdom (website monitoring) And it's not hard to add a new widget to integrate another system. This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
  59. Dashing is a Sinatra based framework that lets you build beautiful dashboards. Originally built by Shopify for showing things on monitors around the office Adopted by the FT - lots of things we care about are very easy to add as tiles: nagios (monitoring) jenkins (build and deployment) pingdom (website monitoring) And it's not hard to add a new widget to integrate another system. This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
  60. Nagios chart gives us the last 24 hours history for each Nagios monitor. Means if we have intermittent errors that happen a lot, we don't miss them. And if something big happens when we're not there, we still know about it So how does it work?
  61. It screenscrapes Nagios for status - this is what that information looks like on nagios. Nagios chart pings this regularly and keeps the information in memory for 24 hours (we go back that far as it lets us see what happened overnight, plus that was the limit before having to store it somewhere other than memory)
  62. Each line is a service - in this case, it's all the services in Production on AWS for one of our teams The name of the service, and of each check that failed, are shown on the left. The bars on the right show the status at any point. All failures are 'soft' failures - e.g. we don't wait for 3 failures to happen before indicating there was a problem. This allows us to see intermittent issues (but probably results in some noise) YELLOW: WARNING status - a minor failure - e.g. a check took slightly over the max time to respond RED: CRITICAL status - a major failure, i.e. no response for a check BLUE: ACKED state So here you can see a large data load happening that put strain onto all our servers - they were in a flapping state for hours. At some point, people started acknowledging the alerts
  63. This one is worse. We had major problems in our Test environment - our graph database fell over. everything that had anything to do with graphs pretty much went down. As it's Test, there was less acknowledging going on
  64. Here's two major problems, one after the other - the pink vertical lines show when nagios chart couldn't connect to nagios, this was down to packet loss on our network. The red bars were a firewall upgrade, eventually rolled back. Again, this is Test. Nagios chart works because it uses the human ability to make sense of patterns - we generally know when things are going wrong just out of the corner of our eye If viewed on your browser, pixel mapping takes you RIGHT to the error in nagios It's been successful - individual teams picked it up and it's been adopted by our platform and environments team, to make it available more generally at the FT. If it sounds interesting, let me know - it's not open sourced yet. … So the final comment on tools is about the tools you use for communication...
  65. That's probably a bit harsh.. But it's certainly not email for me. Even if you get the numbers down to a manageable level, threaded view isn't good for alerts - and it's hard to work out what they mean from this view (I realised after I took this screenshot that these aren't even alerts for my system - another team copied config and sent us all their alerts for a while) And we are moving away from email for team communication at the FT…
  66. We're using Slack a lot - most people have a Slack client open. Slack has great integration tools webhooks let you call an http endpoint and post a message email integration fits well with existing tools - anything that can send an email can send a Slack message One of my colleagues tried to persuade me to set up a separate channel for our alerts, not using the main team channel. I think that's effectively saying "Put it somewhere where I can ignore it" If you are getting so many of these alerts that it's annoying, there are two things you can do: tune the alert (e.g. API requests, increased number of failures in a ten minute period so we tend to get this alert for real issues not network blips) fix your broken system
  67. One other thing I'm trying to persuade people to do is use Slack reactions to show that you've picked up an alert, and fixed it I read that editorial teams are using Slack like this to move content through a workflow. We tend to reach with a tick where we fixed something, with 'eyes' if we're looking into it still But the problem I have is the creativity of developers - I have to ask people what they mean by a dancing lady...
  68. If you put screens up that are clear in what they are showing - you'll notice when things go wrong Non-developers on the team will also notice and tell you something's started flashing Don't loop between screens - put something up that tells you what you need to know. Have more than one screen!
  69. You have to keep a focus on them or they start to get untidy
  70. Did you do something as a result of getting it? If no, delete it
  71. Language should be clear - avoid jargon Get rid of typos Link to useful documentation Get your newest developer to read it Get someone from another team to read it
  72. This is text for an email alert based on looking at access log response times First of all - what a really developer title for the alert: no spaces, categorised by how often it runs rather than what it means
  73. Next up - it MIGHT result in articles not getting published. I want to know if it DID result in articles not getting published. Also - the business doesn't care about my Methode API microservice (which is a microservice wrapping calls over CORBA so that most people don't have to deal with CORBA) But our alerts also have a Technical impact section...
  74. I have no idea why we decided these and only these were the reasons for slow response times. It doesn't help me work out which of these is currently the issue.
  75. First of all - spaces in the title! This is better - at least I can tell that it's a publish failure, from our Methode CMS.
  76. And I can see which articles failed.
  77. And I can go and look at a run book for more information - in fact, the run book links to somewhere (actually a Jenkins job) where you can enter the list of UUIDs and kick off a republish process. (yes, could be automated, but sometimes you want to check it's not going to fail the second time, e.g. editors use their systems in ways we didn't predict) All of which make it much less annoying to have to deal with an alert. This alert goes to some people in our editorial department, so they can check status and republish So whenever you get an alert, really look at it
  78. If someone had to come and tell you your system is broken, you probably need to find a way to know first the next time Although… for some things, a slack channel that people know about is pretty good
  79. Maybe you need to create a synthetic request, or add the right logs and create a Splunk alert
  80. Here - something that picked up when the percentage of failures increased told us we had a problem
  81. We've had a case where the integration that tells us when an article is published broke Our monitoring starts from that notification We found out via manual testing 3 days later We asked the CMS team to add their own monitoring - but we also added a brute force test ourselves - "did we see any blog publishes in the last day?"
  82. I managed to turn off our publication failure alerts because I "improved" some logging We worked this out when part of our data centre went down but we didn't see these alerts firing
  83. If your log entry is the basis for an alert, add a unit test that will fail if it's changed and explicitly says what the impact is
  84. Maybe you take down one of your systems and check that you can tell the impact from the alerts you get Maybe do this blind and see how quickly the Ops Cop can work out what's broken (we haven't done this but I'd like to) Or you can take part in company exercises - the FT took down one of our data centres earlier this year. We'd built up to it with smaller tests, did it an an agreed date, and made sure the right people were available. Crucially, every issue we found was worked through. We turned off a different data centre last weekend. For us as developers, a few weeks before we started thinking about what might happen. We KNEW we didn't have resilience for one part of our system as part of a phased approach to delivery. However, when we started to think about what was going to happen, we found several unexpected reasons why we weren't going to have a working system (bad configuration, mostly) - we had those fixed before the day.
  85. Netflix have their Chaos Monkey for testing resilience by randomly killing instances and services (in fact they have an entire Simian Army to test resilience at different levels) The FT has its own Chaos Snail. If you're wondering why it's called that, it's smaller-scale than the chaos monkey, and it's written in shell This runs on a virtual machine, kills processes as root, and records its work. It's a good way to see if your alerts are working.
  86. It needs to be at least as available as the system it's monitoring This is something that took us a while to really get to grips with. But if the monitoring system is down, you have no idea what the state of your system is. … So that's it from me in terms of advice, so I guess the question is...
  87. Zero emails from Nagios - we have our inbox back! We rely on our other tools
  88. We can't miss them. They are genuine alerts
  89. So we can see how we're doing on response times and error rates at any point
  90. There are lots of good reasons to do that But realise what it means to support them
  91. Think about it from the start Make sure you have the right tools Continue to cultivate your alerts
  92. Our company page on Stack overflow describes our technologies and the culture of our Technology department We also have a technology blog where we talk about some of the things we're trying out
  93. We have lots of our code on github and are doing this more and more