Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

•Download as PPTX, PDF•

4 likes•965 views

You've heard all about what microservices can do for you. You're convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, in three data centres, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. You can't pick out the important stuff and your inbox is unusable. Something needs to change, and this talk will explain what and how.

Technology

MILAN 20/21.11.2015
Alert overload: How to adopt a
microservices architecture without being
overwhelmed with noise
Sarah Wells - Financial Times
@sarahjwells

microservices (n,pl): an efficient device for
transforming business problems into distributed
transaction problems
@drsnooks

45 microservices
3 environments
2 instances for each service

45 microservices
3 environments
2 instances for each service
20 checks per service

45 microservices
3 environments
2 instances for each service
20 checks per service
running every 5 minutes

Over 19,000 system
monitoring alerts in 50 days

Over 19,000 system
monitoring alerts in 50 days
An average of 380 per day

12,745 response time/error
alerts in 50 days

12,745 response time/error
alerts
An average of 255 per day

http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

Quick starts: attack your problem
See our EngineRoom blog for more:
http://bit.ly/1PP7uQQ

It's the business functionality you care about

We care about whether published content made it to us

When people call our APIs, we care about speed

But it's the end-to-end that matters
https://www.flickr.com/photos/robef/16537786315/

You only want an alert where you need to take
action

If you just want information, create a dashboard or report

Make the alert great
http://www.thestickerfactory.co.uk/

Transaction ids tie all microservices together

$Healthchecks tell you whether a service is OK GET http://{service}/__health$

$Healthchecks tell you whether a service is OK GET http://{service}/__health returns 200 if the service can run the healthcheck$

$Healthchecks tell you whether a service is OK GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false$

Synthetic requests tell you about problems early
https://www.flickr.com/photos/jted/5448635109

metrics:
reporters:
- type: graphite
frequency: 1 minute
durationUnit: milliseconds
rateUnit: seconds
host: <%= @graphite.host %>
port: 2003
prefix: content.<%= @config_env %>.api-policy-component.<%=
scope.lookupvar('::hostname') %>

SAWS
Built by Silvano Dossan
See our Engine room blog: http://bit.ly/1GATHLy

"I imagine most people do exactly
what I do - create a google filter to
send all Nagios emails straight to the
bin"

"Our screens have a viewing angle of
about 10 degrees"

"Our screens have a viewing angle of
about 10 degrees"
"It never seems to show the page I
want"

Nagios chart
Built by Simon Gibbs
@simonjgibbs

If it isn't
helpful, make
sure you don't
get sent it
again

See if you can improve it
www.workcompass.com/

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...

…
Technical Impact
The server is experiencing service degradation because of
network latency, high publishing load, high bandwidth
utilization, excessive memory or cpu usage on the VM. This
might result in failure to publish articles to the new content
platform.

Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Setting up an alert is part of fixing the problem
✔ code
✔ test
alerts

System boundaries are more difficult
Severin.stalder [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via
Wikimedia Commons

Make sure you would know if an alert stopped
working

Add a unit test
public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {
…
}

The thing that sends you alerts need to be up and running
https://www.flickr.com/photos/davidmasters/2564786205/

We turned off ALL emails from
system monitoring

Our two most important alerts
come in via our team slack
channel

We have dashboards for
our read APIs in Grafana

About technology at the FT:
Look us up on Stack Overflow
http://bit.ly/1H3eXVe
Read our blog
http://engineroom.ft.com/

The FT on github
https://github.com/Financial-Times/
https://github.com/ftlabs

What's hot

Cypress Tech Talk August 4 2015dczulada

MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...MITRE - ATT&CKcon

SplunkSummit 2015 - Security NinjitsuSplunk

SplunkSummit 2015 - ES Hands On Workshop Splunk

SplunkSummit 2015 - Splunking the EndpointSplunk

The present and future of Serverless observability (Serverless Computing London)Yan Cui

Secure your Web Application With The New Python Audit HooksNicolas Vivet

How to Monitoring the SRE Golden Signals (E-Book)Siglos

Drupal Dev Days 2018 - Autopsy of Vulnerabilitieszekivazquez

Attack-driven defenseZane Lackey

DockerCon SF 2019 - Observability WorkshopKevin Crawley

Microservices 5 things i wish i'd known code motionVincent Kok

Conf2014_SplunkSecurityNinjutsuSplunk

Logging for Hackers v1.0Michael Gough

Incident Resolution as CodeJulien Pivotto

Purple team is awesomeSumedt Jitpukdebodin

MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...MITRE - ATT&CKcon

Dev Talk: Event Manipulation and TestingJason Stanley

What's hot (18)

Cypress Tech Talk August 4 2015

MITRE ATT&CKcon 2018: Building an Atomic Testing Program, Brian Beyer, Red Ca...

SplunkSummit 2015 - Security Ninjitsu

SplunkSummit 2015 - ES Hands On Workshop

SplunkSummit 2015 - Splunking the Endpoint

The present and future of Serverless observability (Serverless Computing London)

Secure your Web Application With The New Python Audit Hooks

How to Monitoring the SRE Golden Signals (E-Book)

Drupal Dev Days 2018 - Autopsy of Vulnerabilities

Attack-driven defense

DockerCon SF 2019 - Observability Workshop

Microservices 5 things i wish i'd known code motion

Conf2014_SplunkSecurityNinjutsu

Logging for Hackers v1.0

Incident Resolution as Code

Purple team is awesome

MITRE ATT&CKcon 2018: VCAF: Expanding the ATT&CK Framework to cover VERIS Thr...

Dev Talk: Event Manipulation and Testing

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Chaos Engineering: Why the World Needs More Resilient SystemsC4Media

An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil

Prometheus - Open Source Forum JapanBrian Brazil

Analytics Driven SIEM WorkshopSplunk

Are you ready for the next attack? Reviewing the SP Security ChecklistAPNIC

Are you ready for the next attack? reviewing the sp security checklist (apnic...Barry Greene

Modern Web Security, Lazy but Mindful Like a FoxC4Media

Architecture: Manual vs. AutomationAmazon Web Services

Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil

Itech 1005Christina Ramirez

Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...Amazon Web Services

ButlerThomas Butler, CISSP, CEH, CHFI, CISA, CPA, CIA

IRJET- A Study on Penetration Testing using Metasploit FrameworkIRJET Journal

Penetration testing using metasploit frameworkPawanKesharwani

What does "monitoring" mean? (FOSDEM 2017)Brian Brazil

SplunkLive! Splunk App for VMwareSplunk

The Present and Future of Serverless ObservabilityC4Media

Integris Security - Hacking With Glue ℠Integris Security LLC

Finding attacks with these 6 eventsMichael Gough

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise (20)

Chaos Engineering: Why the World Needs More Resilient Systems

An Introduction to Prometheus (GrafanaCon 2016)

Prometheus - Open Source Forum Japan

Analytics Driven SIEM Workshop

Are you ready for the next attack? Reviewing the SP Security Checklist

Are you ready for the next attack? reviewing the sp security checklist (apnic...

Modern Web Security, Lazy but Mindful Like a Fox

Architecture: Manual vs. Automation

Evolution of Monitoring and Prometheus (Dublin 2018)

Itech 1005

Start Up Austin 2017: Manual vs Automation - When to Start Automating your Pr...

Butler

IRJET- A Study on Penetration Testing using Metasploit Framework

Penetration testing using metasploit framework

What does "monitoring" mean? (FOSDEM 2017)

SplunkLive! Splunk App for VMware

The Present and Future of Serverless Observability

Integris Security - Hacking With Glue ℠

Finding attacks with these 6 events

Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...

Recently uploaded

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Bluetooth Controlled Car with Arduino.pdfngoud9212

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Install Stable Diffusion in windows machinePadma Pradeep

costume and set research powerpoint presentationphoebematthew05

Recently uploaded (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Scanning the Internet for External Cloud Exposures via SSL Certs

Science&tech:THE INFORMATION AGE STS.pdf

Advanced Test Driven-Development @ php[tek] 2024

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Connect Wave/ connectwave Pitch Deck Presentation

The transition to renewables in India.pdf

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

DMCC Future of Trade Web3 - Special Edition

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

APIForce Zurich 5 April Automation LPDG

Unblocking The Main Thread Solving ANRs and Frozen Frames

Streamlining Python Development: A Guide to a Modern Project Setup

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Bluetooth Controlled Car with Arduino.pdf

Designing IA for AI - Information Architecture Conference 2024

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Install Stable Diffusion in windows machine

costume and set research powerpoint presentation

Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

1. MILAN 20/21.11.2015 Alert overload: How to adopt a microservices architecture without being overwhelmed with noise Sarah Wells - Financial Times @sarahjwells

3. Microservices make it worse

4. microservices (n,pl): an efficient device for transforming business problems into distributed transaction problems @drsnooks

5. You have a lot more systems

6. 45 microservices

7. 45 microservices 3 environments

8. 45 microservices 3 environments 2 instances for each service

9. 45 microservices 3 environments 2 instances for each service 20 checks per service

10. 45 microservices 3 environments 2 instances for each service 20 checks per service running every 5 minutes

11. > 1,500,000 system checks per day

12. Over 19,000 system monitoring alerts in 50 days

13. Over 19,000 system monitoring alerts in 50 days An average of 380 per day

14. Functional monitoring is also an issue

15. 12,745 response time/error alerts in 50 days

16. 12,745 response time/error alerts An average of 255 per day

17. Why so many?

18.

19.

20.

21.

22. http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

23. How can you make it better?

24. Quick starts: attack your problem See our EngineRoom blog for more: http://bit.ly/1PP7uQQ

25. 1 2 3

26. Think about monitoring from the start 1

27. It's the business functionality you care about

28.

29.

30. 1

31. 2 1

32. 3 1 2

33. 4 1 2 3

34. We care about whether published content made it to us

35. When people call our APIs, we care about speed

36. … we also care about errors

37. But it's the end-to-end that matters https://www.flickr.com/photos/robef/16537786315/

38. You only want an alert where you need to take action

39. If you just want information, create a dashboard or report

40. Make sure you can't miss an alert

41. Make the alert great http://www.thestickerfactory.co.uk/

42. Build your system with support in mind

43. Transaction ids tie all microservices together

44.

45. Healthchecks tell you whether a service is OK GET http://{service}/__health

46. Healthchecks tell you whether a service is OK GET http://{service}/__health returns 200 if the service can run the healthcheck

47. Healthchecks tell you whether a service is OK GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false

48.

49.

50. Synthetic requests tell you about problems early https://www.flickr.com/photos/jted/5448635109

51. Use the right tools for the job 2

52. There are basic tools you need

53. FT Platform: An internal PaaS

54. Service monitoring (e.g. Nagios)

55. Log aggregation (e.g. Splunk)

56. Graphing (e.g. Graphite/Grafana)

57. metrics: reporters: - type: graphite frequency: 1 minute durationUnit: milliseconds rateUnit: seconds host: <%= @graphite.host %> port: 2003 prefix: content.<%= @config_env %>.api-policy-component.<%= scope.lookupvar('::hostname') %>

58.

59.

60. Real time error analysis (e.g. Sentry)

61. Build other tools to support you

62. SAWS Built by Silvano Dossan See our Engine room blog: http://bit.ly/1GATHLy

63. "I imagine most people do exactly what I do - create a google filter to send all Nagios emails straight to the bin"

64. "Our screens have a viewing angle of about 10 degrees"

65. "Our screens have a viewing angle of about 10 degrees" "It never seems to show the page I want"

66. Code at: https://github.com/muce/SAWS

67. Dashing

68.

69. Nagios chart Built by Simon Gibbs @simonjgibbs

70.

71.

72.

73.

74. Use the right communication channel

75. It's not email

76. Slack integration

77.

78. Radiators everywhere

79. Cultivate your alerts 3

80. Review the alerts you get

81. If it isn't helpful, make sure you don't get sent it again

82. See if you can improve it www.workcompass.com/

83. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...

84. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out. ...

85. … Technical Impact The server is experiencing service degradation because of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.

86. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

87. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

88. Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information. _time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

89. When you didn't get an alert

90. What would have told you about this?

91.

92. Setting up an alert is part of fixing the problem ✔ code ✔ test alerts

93. System boundaries are more difficult Severin.stalder [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

94. Make sure you would know if an alert stopped working

95. Add a unit test public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() { … }

96. Deliberately break things

97. Chaos snail

98. The thing that sends you alerts need to be up and running https://www.flickr.com/photos/davidmasters/2564786205/

99. What's happened to our alerts?

100. We turned off ALL emails from system monitoring

101. Our two most important alerts come in via our team slack channel

102. We have dashboards for our read APIs in Grafana

103. To summarise...

104. Build microservices

105. 1 2 3

106. About technology at the FT: Look us up on Stack Overflow http://bit.ly/1H3eXVe Read our blog http://engineroom.ft.com/

107. The FT on github https://github.com/Financial-Times/ https://github.com/ftlabs

108. Thank you!

109. Questions?

Editor's Notes

Two years ago, I started working on a new project at the FT, rebuilding our content platform and APIs. We're using a microservice architecture. I'm here to talk about what it's like to move from monitoring a monolithic application to monitoring a whole lot of microservices. Which is also about what it's like to start doing devops, because when you are building new microservices whenever you need, and throwing them away when they stop being useful, you can't do a handover to a separate operations team each time: it takes too long. So you are going to be supporting your services and the pain that used to be felt by operations when you didn't get monitoring and alerting right, is now being felt by you…
I'm guessing a lot of people in this room have been on a support mailing list at some point, so this probably looks familiar. Too many emails, and very hard to work out what they really mean. The bad news is ...
I saw this recently and it made me laugh. BUT - there are lots of things I really like about microservices! It's easy to reason about the logic within a microservice it's easier to deploy small changes both quickly and reversibly, it's easy to change your architecture, and once you have, it's easy to remove the code you don't need any more, because it's all in one service and you can check that nothing is calling it via the access logs for the service… So I don't want to go back to writing monolithic applications - but I do think that monitoring is harder for a microservice architecture. So why is that?
Firstly, instead of 1 service, we have 45
We currently have Integration, Test and Production environments. There's some debate about whether we need three and other teams at the FT only have production
We have at least 2 instances, for resilience, and sometimes more. And at the moment, each of those is on it's own VM
These are system checks - disk space, CPU load, NTP, DNS
Most of the checks run more often than every 5 minutes in fact
Which means you get alerts for unlikely and transient issues all the time. Earlier this year, a new developer joined our team, and he couldn't believe the number of alert emails we were getting. He started counting.
And that's on average. When shared infrastructure goes wrong, for example if system time isn't being properly synchronised or someone accidentally switched off a DNS server, if you're monitoring it from every server EVERYTHING lights up As an example, we use puppet to automate server setup and deployment - and we had 20000 alert emails overnight for a PLANNED failover of our puppet master But it's not just system monitoring that is painful...
We started out creating alerts and monitoring a lot like we did for monolithic applications: alerts based on response time alerts for ERROR logs or responses that are server error status codes - 500s for example
First off, where in a monolith you were calling a function, now you're making an http request which means there are more things that can go wrong
If one thing fails...
You'll get an alert from the service using it...
But if you're naive in the way you set up alerts, you'll also get an alert from anything calling THAT service Getting alerts from multiple services can also make it difficult to find the cause And when things DO go wrong...
This is what it feels like …
You need to be able to support your system, which means you need to sort out your monitoring and alerting. ... It was clear this was causing us problems, especially when we looked at the numbers : with the system and functional monitoring alerts added together, that's one every 5 minutes so with the support of our Product Owner, we took some time to work on this.
We have a thing we do at the FT called a Quickstart - we take a small team, maybe from several different projects or skillsets, and we put them in a room together No specific requirements, no backlog - just a topic of interest. From feedback, it's apparently very important that free coffee and biscuits get delivered twice a day… In this case - we focussed on alerts and how to make them more useful and rescue our email inboxes (There's more details on this on our Technology blog, the Engine room)
As a result of this I can tell you about three principles that helped us to reduce the number of alerts and spend less time responding to false alarms and confusing information
We got some things right, and I'll cover those later What we got wrong is that we created far too many alerts without thinking about why we were doing it… it was just another thing on the checklist - create an alert. The problem is, you probably don't care about these alerts. I mean, how much do you care about NTP issues in non-production environments? But more importantly, you don't care about response times or errors where a service is just passing on what it got from lower down the stack
27. It's the business functionality you care about Not the individual microservice.
For example, we are responsible for publishing FastFT posts - if that widget on the right on our site home page stops getting the latest updates, we will hear about it So that's what our alerts should be focussing on So to tell you what's important to us, I need to tell you a bit about our system...
This is a logical view of the Universal Publishing Platform multiple source content management systems, sending us articles, blogs, images, vidoes etc when content is published, it's transformed into a common format and annotated using a concept extraction pipeline we also have metadata taxonomies like organisations, people, memberships, all loaded in then there are APIs to get content and metadata about content articles about Apple -> Information about Apple -> Information about Tim Cook -> Other companies he's involved with, etc. etc. etc Architecturally, we have a mix of Go and Java/Dropwizard apps. We use Kafka to send messages about events. We have GraphDB and Mongo data stores. So what is our key business functionality?
1. Publishing and transforming content
2. Annotating that content - i.e. working out which companies an article mentions, or what person it's about
3. Loading updates of our data about organisations, people, etc
4. Making all that information available via APIs But it's not the same things we care about for each...
We want to know about every failure, because each failure is a story that our customers can't read yet Our alert should make it clear we've failed to publish something, AND what needs to be done to fix it
For publication, there aren't that many events a day - maybe 600. We can look at individual events. For our APIs, we have 2.8 million requests a day at the moment, a little over 30 a second. So we look at 95th and 99th percentile response time, for example, to make sure they're ok. It doesn't have to be super fast, but it definitely can't be super slow But we don't JUST care about speed...
i.e. did something go wrong. The obvious thing to look for is server errors - something has gone wrong somewhere in our stack. The graph here shows when some of our blades failed in a data centre. This is for some business functionality that's not critical at the moment, so we are comfortable with all the nodes being in the same data centre, in case you're wondering why a blade failure would break things! The sudden increase in 500 errors triggered our alerts so we knew about this really quickly. However, we also look for client errors a sudden increase in 400 errors, i.e. bad requests, could be your fault. We've made changes that turned out to break our API contract - e.g. POST requests suddenly needed to have Content-Type header application/json. Meets http spec, but is less lenient, and so BAD. We would want an alert for that.
We have built in back off and retry for recoverable errors Sometimes the first request fails, and the second one succeeds. We don't want an alert in that case. We might want a report, so we know we have a flaky connection. Or we might just accept that our network is evil.
Otherwise, it's just noise Your alerts should be something you don't mind being interrupted about
You can go look at it whenever you want. I bet you won't look at it as often as you think you will We got rid of: all our publish microservice-specific response time alerts all our microservice-specific error alerts and made the most interesting ones into real-time dashboards
Now your alerts really mean you need to react, make them unmissable. This means they need to attract the attention of the people that need to react. How you do that depends on your team and your working practices We have an 'Ops Cop', and take it in turns to do that role for a week. The ops cop will also take on small pieces of work, tidying up, refactoring - things that don't need you to be in flow (because you WILL get interrupted)
Anyone reading the alert should be able to work out: what it actually means the action they need to take who to talk to if they get stuck Use clear language and don't be vague. Add a link to explanatory information (panic guide) - this needs to be clear too, and needs to be reviewed by someone who may have to use it but didn't write the service (e.g. new team members who've never had to look at this service/operations) Consider how to make "future you's" life easier: here's a search link to show you the whole transaction here's a jenkins job to republish
Our transaction IDs are adding to logs using MDC (Mapped Diagnostic Context) Every microservice we write needs to checked for a special X-Request-Id header (we do this via a Servlet Filter) then add it to the thread context. Any requests over http must pass on the X-Request-Id header too.
This means all logs for a particular user request will have a unique identifier logged and we can look at everything that happened when an article was published or a read request was made
We have an FT standard for healthchecks - you must return a particular json response on a particular endpoint.
You return 200 for unhealthy as well - there was some debate about this, the logic is that a 500 indicates that the healthcheck can't be run, which is different from it failing
You have to look at each check to work out whether you have any failing checks
This is what the json looks like
There's a chrome plugin to make it look nicer for humans
You want to know about problems before they affect your customers, if possible. We started off with synthetic publication requests. Synthetic publication takes a known, old article, and publishes it every minute. If this breaks, we can fix it before a single real publish fails.
By basic, I mean standard
A puppet based framework goal: for developers to reliably build & deploy services* from "zero-to-customer" in less than 15mins. ... across data centres, with monitoring... supports multiple IaaS providers digression: some debate about FT Platform internally, some teams aren't using it: heroku or 'naked' AWS personal opinion: bootstrapped this type of deployment at the FT, and at the time most developers weren't that familiar with the underlying tools, but if you are already familiar with heroku and AWS, it can feel like you're being restricted we're now evolving FT platform to reflect that, with a move to CloudFormation and an internal tool called Konstructor that provides an API wrapper round a lot of our other tools however: gave us monitoring and log aggregation for any new microservice with no additional effort
nagios monitors system metrics, network protocols, applications, services, servers, and network infrastructure alerts via email or (god forbid) SMS when there are failures and when the service recovers you can acknowledge alerts to stop the notifications put into maintenance mode for known downtimes
Every VM set up using FT Platform automatically forwards logs to Splunk. Any queries you want to do across all hosts in a service, or all services that take part in a particular event is easy to do without having to jump onto the relevant box We use it to identify problems and alert visualise performance or load create dashboards for particular services But more recently, we're moving away from Splunk dashboards..
And instead we're graphing our metrics using Graphite and Grafana. We're using Dropwizard for our Java apps and that comes with codahale metrics embedded. It's a small config change to write those metrics to a graphite server...
Graphite isn't particularly pretty - you can see all the metrics and compose graphs on the fly... ... but by using Grafana on top of it you can easily create beautiful custom dashboards...
.They're quick to load as well This shows one of our Read API components, so we're interested in server errors, client errors, successful requests And also request rate across hosts. Interesting - here, the traffic started to switch over from one data centre to another, I have no idea why!
We were using Splunk to pick up ERROR level logs The problem there is that every ERROR results in an alert. You might be more hardcore about this than me, but unless you have zero tolerance of ERROR logs, there will be times when there are some errors that aren't a priority - they don't represent a major issue and there aren't that many We got some of those from the client we use to talk to Kafka We were ignoring them and missed a problem someone introduced that also caused ERROR logs. That wouldn't happen in Sentry or equivalent tools, because each new error TYPE results in an alert. Again, sending information for a Dropwizard app to Sentry is a simple configuration to send logs out to the sentry API OK, so that's the basic tools...
If the basic tools aren't giving you what you need, build your own. This is easier if those basic tools have good APIs - because you can create your own view easily Our first 'extra' tool was created by one of our integration engineers - he turned up with it one day…
SAWS Built using Blinky tape - a programmable LED strip Each section represents a different part of our system Things light up when there's a problem, and when there isn't a problem, the blue lights swoosh back and forth so you know the monitoring is still running. It used to be really cool and run on a Raspberry Pi - it's a Python script - but that broke and now it runs on an old Windows box under someone's desk. So why did Silvano create this? First off, frustrations with the number of emails...
Which he was sending straight to the bin...
And secondly, frustration with monitoring screens
He wanted something that was easy to instantly see if there was a problem
This is SAWS up in our office. It's pretty simple - red indicates something bad has happened. and he also changed from green to blue after this to make sure everyone can see if there's a problem… It's not really this bright :) So that was our first tool. Our second tool addresses the problem of waiting for screens to cycle through to see the one you want to see - by providing a single screen that can tell you what you need to know...
Dashing is a Sinatra based framework that lets you build beautiful dashboards. Originally built by Shopify for showing things on monitors around the office Adopted by the FT - lots of things we care about are very easy to add as tiles: nagios (monitoring) jenkins (build and deployment) pingdom (website monitoring) And it's not hard to add a new widget to integrate another system. This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
Dashing is a Sinatra based framework that lets you build beautiful dashboards. Originally built by Shopify for showing things on monitors around the office Adopted by the FT - lots of things we care about are very easy to add as tiles: nagios (monitoring) jenkins (build and deployment) pingdom (website monitoring) And it's not hard to add a new widget to integrate another system. This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
Nagios chart gives us the last 24 hours history for each Nagios monitor. Means if we have intermittent errors that happen a lot, we don't miss them. And if something big happens when we're not there, we still know about it So how does it work?
It screenscrapes Nagios for status - this is what that information looks like on nagios. Nagios chart pings this regularly and keeps the information in memory for 24 hours (we go back that far as it lets us see what happened overnight, plus that was the limit before having to store it somewhere other than memory)
Each line is a service - in this case, it's all the services in Production on AWS for one of our teams The name of the service, and of each check that failed, are shown on the left. The bars on the right show the status at any point. All failures are 'soft' failures - e.g. we don't wait for 3 failures to happen before indicating there was a problem. This allows us to see intermittent issues (but probably results in some noise) YELLOW: WARNING status - a minor failure - e.g. a check took slightly over the max time to respond RED: CRITICAL status - a major failure, i.e. no response for a check BLUE: ACKED state So here you can see a large data load happening that put strain onto all our servers - they were in a flapping state for hours. At some point, people started acknowledging the alerts
This one is worse. We had major problems in our Test environment - our graph database fell over. everything that had anything to do with graphs pretty much went down. As it's Test, there was less acknowledging going on
Here's two major problems, one after the other - the pink vertical lines show when nagios chart couldn't connect to nagios, this was down to packet loss on our network. The red bars were a firewall upgrade, eventually rolled back. Again, this is Test. Nagios chart works because it uses the human ability to make sense of patterns - we generally know when things are going wrong just out of the corner of our eye If viewed on your browser, pixel mapping takes you RIGHT to the error in nagios It's been successful - individual teams picked it up and it's been adopted by our platform and environments team, to make it available more generally at the FT. If it sounds interesting, let me know - it's not open sourced yet. … So the final comment on tools is about the tools you use for communication...
That's probably a bit harsh.. But it's certainly not email for me. Even if you get the numbers down to a manageable level, threaded view isn't good for alerts - and it's hard to work out what they mean from this view (I realised after I took this screenshot that these aren't even alerts for my system - another team copied config and sent us all their alerts for a while) And we are moving away from email for team communication at the FT…
We're using Slack a lot - most people have a Slack client open. Slack has great integration tools webhooks let you call an http endpoint and post a message email integration fits well with existing tools - anything that can send an email can send a Slack message One of my colleagues tried to persuade me to set up a separate channel for our alerts, not using the main team channel. I think that's effectively saying "Put it somewhere where I can ignore it" If you are getting so many of these alerts that it's annoying, there are two things you can do: tune the alert (e.g. API requests, increased number of failures in a ten minute period so we tend to get this alert for real issues not network blips) fix your broken system
One other thing I'm trying to persuade people to do is use Slack reactions to show that you've picked up an alert, and fixed it I read that editorial teams are using Slack like this to move content through a workflow. We tend to reach with a tick where we fixed something, with 'eyes' if we're looking into it still But the problem I have is the creativity of developers - I have to ask people what they mean by a dancing lady...
If you put screens up that are clear in what they are showing - you'll notice when things go wrong Non-developers on the team will also notice and tell you something's started flashing Don't loop between screens - put something up that tells you what you need to know. Have more than one screen!
You have to keep a focus on them or they start to get untidy
Did you do something as a result of getting it? If no, delete it
Language should be clear - avoid jargon Get rid of typos Link to useful documentation Get your newest developer to read it Get someone from another team to read it
This is text for an email alert based on looking at access log response times First of all - what a really developer title for the alert: no spaces, categorised by how often it runs rather than what it means
Next up - it MIGHT result in articles not getting published. I want to know if it DID result in articles not getting published. Also - the business doesn't care about my Methode API microservice (which is a microservice wrapping calls over CORBA so that most people don't have to deal with CORBA) But our alerts also have a Technical impact section...
I have no idea why we decided these and only these were the reasons for slow response times. It doesn't help me work out which of these is currently the issue.
First of all - spaces in the title! This is better - at least I can tell that it's a publish failure, from our Methode CMS.
And I can see which articles failed.
And I can go and look at a run book for more information - in fact, the run book links to somewhere (actually a Jenkins job) where you can enter the list of UUIDs and kick off a republish process. (yes, could be automated, but sometimes you want to check it's not going to fail the second time, e.g. editors use their systems in ways we didn't predict) All of which make it much less annoying to have to deal with an alert. This alert goes to some people in our editorial department, so they can check status and republish So whenever you get an alert, really look at it
If someone had to come and tell you your system is broken, you probably need to find a way to know first the next time Although… for some things, a slack channel that people know about is pretty good
Maybe you need to create a synthetic request, or add the right logs and create a Splunk alert
Here - something that picked up when the percentage of failures increased told us we had a problem
We've had a case where the integration that tells us when an article is published broke Our monitoring starts from that notification We found out via manual testing 3 days later We asked the CMS team to add their own monitoring - but we also added a brute force test ourselves - "did we see any blog publishes in the last day?"
I managed to turn off our publication failure alerts because I "improved" some logging We worked this out when part of our data centre went down but we didn't see these alerts firing
If your log entry is the basis for an alert, add a unit test that will fail if it's changed and explicitly says what the impact is
Maybe you take down one of your systems and check that you can tell the impact from the alerts you get Maybe do this blind and see how quickly the Ops Cop can work out what's broken (we haven't done this but I'd like to) Or you can take part in company exercises - the FT took down one of our data centres earlier this year. We'd built up to it with smaller tests, did it an an agreed date, and made sure the right people were available. Crucially, every issue we found was worked through. We turned off a different data centre last weekend. For us as developers, a few weeks before we started thinking about what might happen. We KNEW we didn't have resilience for one part of our system as part of a phased approach to delivery. However, when we started to think about what was going to happen, we found several unexpected reasons why we weren't going to have a working system (bad configuration, mostly) - we had those fixed before the day.
Netflix have their Chaos Monkey for testing resilience by randomly killing instances and services (in fact they have an entire Simian Army to test resilience at different levels) The FT has its own Chaos Snail. If you're wondering why it's called that, it's smaller-scale than the chaos monkey, and it's written in shell This runs on a virtual machine, kills processes as root, and records its work. It's a good way to see if your alerts are working.
It needs to be at least as available as the system it's monitoring This is something that took us a while to really get to grips with. But if the monitoring system is down, you have no idea what the state of your system is. … So that's it from me in terms of advice, so I guess the question is...
Zero emails from Nagios - we have our inbox back! We rely on our other tools
We can't miss them. They are genuine alerts
So we can see how we're doing on response times and error rates at any point
There are lots of good reasons to do that But realise what it means to support them
Think about it from the start Make sure you have the right tools Continue to cultivate your alerts
Our company page on Stack overflow describes our technologies and the culture of our Technology department We also have a technology blog where we talk about some of the things we're trying out
We have lots of our code on github and are doing this more and more

Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Similar to Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Editor's Notes