You've heard all about what microservices can do for you. You're convinced. So you build some. Reasoning about your functionality is way easier: these services are so simple! Then you get to the point where you have 35 microservices, in three data centres, and all the monitoring and alerting tactics you used for your monoliths are a complete disaster. You can't pick out the important stuff and your inbox is unusable. Something needs to change, and this talk will explain what and how.
Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
1. MILAN 20/21.11.2015
Alert overload: How to adopt a
microservices architecture without being
overwhelmed with noise
Sarah Wells - Financial Times
@sarahjwells
46. Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
47. Healthchecks tell you whether a service is OK
GET http://{service}/__health
returns 200 if the service can run the healthcheck
each check will return "ok": true or "ok": false
48.
49.
50. Synthetic requests tell you about problems early
https://www.flickr.com/photos/jted/5448635109
82. See if you can improve it
www.workcompass.com/
83. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
84. Splunk Alert: PROD - MethodeAPIResponseTime5MAlert
Business Impact
The methode api server is slow responding to requests.
This might result in articles not getting published to the new
content platform or publishing requests timing out.
...
85. …
Technical Impact
The server is experiencing service degradation because of
network latency, high publishing load, high bandwidth
utilization, excessive memory or cpu usage on the VM. This
might result in failure to publish articles to the new content
platform.
86. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
87. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
88. Splunk Alert: PROD Content Platform Ingester Methode
Publish Failures Alert
There has been one or more publish failures to the
Universal Publishing Platform. The UUIDs are listed below.
Please see the run book for more information.
_time transaction_id uuid
Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe
Two years ago, I started working on a new project at the FT, rebuilding our content platform and APIs. We're using a microservice architecture.
I'm here to talk about what it's like to move from monitoring a monolithic application to monitoring a whole lot of microservices.
Which is also about what it's like to start doing devops, because when you are building new microservices whenever you need, and throwing them away when they stop being useful, you can't do a handover to a separate operations team each time: it takes too long.
So you are going to be supporting your services and the pain that used to be felt by operations when you didn't get monitoring and alerting right, is now being felt by you…
I'm guessing a lot of people in this room have been on a support mailing list at some point, so this probably looks familiar.
Too many emails, and very hard to work out what they really mean.
The bad news is ...
I saw this recently and it made me laugh.
BUT - there are lots of things I really like about microservices!
It's easy to reason about the logic within a microservice
it's easier to deploy small changes both quickly and reversibly,
it's easy to change your architecture, and once you have,
it's easy to remove the code you don't need any more, because it's all in one service and you can check that nothing is calling it via the access logs for the service…
So I don't want to go back to writing monolithic applications - but I do think that monitoring is harder for a microservice architecture.
So why is that?
Firstly, instead of 1 service, we have 45
We currently have Integration, Test and Production environments.
There's some debate about whether we need three and other teams at the FT only have production
We have at least 2 instances, for resilience, and sometimes more.
And at the moment, each of those is on it's own VM
These are system checks - disk space, CPU load, NTP, DNS
Most of the checks run more often than every 5 minutes in fact
Which means you get alerts for unlikely and transient issues all the time.
Earlier this year, a new developer joined our team, and he couldn't believe the number of alert emails we were getting. He started counting.
And that's on average.
When shared infrastructure goes wrong, for example if system time isn't being properly synchronised or someone accidentally switched off a DNS server, if you're monitoring it from every server EVERYTHING lights up
As an example, we use puppet to automate server setup and deployment - and we had 20000 alert emails overnight for a PLANNED failover of our puppet master
But it's not just system monitoring that is painful...
We started out creating alerts and monitoring a lot like we did for monolithic applications:
alerts based on response time
alerts for ERROR logs or responses that are server error status codes - 500s for example
First off, where in a monolith you were calling a function, now you're making an http request which means there are more things that can go wrong
If one thing fails...
You'll get an alert from the service using it...
But if you're naive in the way you set up alerts, you'll also get an alert from anything calling THAT service
Getting alerts from multiple services can also make it difficult to find the cause
And when things DO go wrong...
This is what it feels like
…
You need to be able to support your system, which means you need to sort out your monitoring and alerting.
...
It was clear this was causing us problems, especially when we looked at the numbers : with the system and functional monitoring alerts added together, that's one every 5 minutes
so with the support of our Product Owner, we took some time to work on this.
We have a thing we do at the FT called a Quickstart - we take a small team, maybe from several different projects or skillsets, and we put them in a room together
No specific requirements, no backlog - just a topic of interest.
From feedback, it's apparently very important that free coffee and biscuits get delivered twice a day…
In this case - we focussed on alerts and how to make them more useful and rescue our email inboxes
(There's more details on this on our Technology blog, the Engine room)
As a result of this I can tell you about three principles that helped us to reduce the number of alerts and spend less time responding to false alarms and confusing information
We got some things right, and I'll cover those later
What we got wrong is that we created far too many alerts without thinking about why we were doing it… it was just another thing on the checklist - create an alert.
The problem is, you probably don't care about these alerts.
I mean, how much do you care about NTP issues in non-production environments?
But more importantly, you don't care about response times or errors where a service is just passing on what it got from lower down the stack
27. It's the business functionality you care about
Not the individual microservice.
For example, we are responsible for publishing FastFT posts - if that widget on the right on our site home page stops getting the latest updates, we will hear about it
So that's what our alerts should be focussing on
So to tell you what's important to us, I need to tell you a bit about our system...
This is a logical view of the Universal Publishing Platform
multiple source content management systems, sending us articles, blogs, images, vidoes etc
when content is published, it's transformed into a common format
and annotated using a concept extraction pipeline
we also have metadata taxonomies like organisations, people, memberships, all loaded in
then there are APIs to get content and metadata about content
articles about Apple -> Information about Apple -> Information about Tim Cook -> Other companies he's involved with, etc. etc. etc
Architecturally, we have a mix of Go and Java/Dropwizard apps. We use Kafka to send messages about events. We have GraphDB and Mongo data stores.
So what is our key business functionality?
1. Publishing and transforming content
2. Annotating that content - i.e. working out which companies an article mentions, or what person it's about
3. Loading updates of our data about organisations, people, etc
4. Making all that information available via APIs
But it's not the same things we care about for each...
We want to know about every failure, because each failure is a story that our customers can't read yet
Our alert should make it clear we've failed to publish something, AND what needs to be done to fix it
For publication, there aren't that many events a day - maybe 600. We can look at individual events.
For our APIs, we have 2.8 million requests a day at the moment, a little over 30 a second.
So we look at 95th and 99th percentile response time, for example, to make sure they're ok.
It doesn't have to be super fast, but it definitely can't be super slow
But we don't JUST care about speed...
i.e. did something go wrong.
The obvious thing to look for is server errors - something has gone wrong somewhere in our stack.
The graph here shows when some of our blades failed in a data centre. This is for some business functionality that's not critical at the moment, so we are comfortable with all the nodes being in the same data centre, in case you're wondering why a blade failure would break things!
The sudden increase in 500 errors triggered our alerts so we knew about this really quickly.
However, we also look for client errors
a sudden increase in 400 errors, i.e. bad requests, could be your fault. We've made changes that turned out to break our API contract - e.g. POST requests suddenly needed to have Content-Type header application/json. Meets http spec, but is less lenient, and so BAD. We would want an alert for that.
We have built in back off and retry for recoverable errors
Sometimes the first request fails, and the second one succeeds. We don't want an alert in that case.
We might want a report, so we know we have a flaky connection. Or we might just accept that our network is evil.
Otherwise, it's just noise
Your alerts should be something you don't mind being interrupted about
You can go look at it whenever you want.
I bet you won't look at it as often as you think you will
We got rid of:
all our publish microservice-specific response time alerts
all our microservice-specific error alerts
and made the most interesting ones into real-time dashboards
Now your alerts really mean you need to react, make them unmissable.
This means they need to attract the attention of the people that need to react. How you do that depends on your team and your working practices
We have an 'Ops Cop', and take it in turns to do that role for a week. The ops cop will also take on small pieces of work, tidying up, refactoring - things that don't need you to be in flow (because you WILL get interrupted)
Anyone reading the alert should be able to work out:
what it actually means
the action they need to take
who to talk to if they get stuck
Use clear language and don't be vague.
Add a link to explanatory information (panic guide) - this needs to be clear too, and needs to be reviewed by someone who may have to use it but didn't write the service (e.g. new team members who've never had to look at this service/operations)
Consider how to make "future you's" life easier:
here's a search link to show you the whole transaction
here's a jenkins job to republish
Our transaction IDs are adding to logs using MDC (Mapped Diagnostic Context)
Every microservice we write needs to checked for a special X-Request-Id header (we do this via a Servlet Filter) then add it to the thread context. Any requests over http must pass on the X-Request-Id header too.
This means all logs for a particular user request will have a unique identifier logged and we can look at everything that happened when an article was published or a read request was made
We have an FT standard for healthchecks - you must return a particular json response on a particular endpoint.
You return 200 for unhealthy as well - there was some debate about this, the logic is that a 500 indicates that the healthcheck can't be run, which is different from it failing
You have to look at each check to work out whether you have any failing checks
This is what the json looks like
There's a chrome plugin to make it look nicer for humans
You want to know about problems before they affect your customers, if possible.
We started off with synthetic publication requests.
Synthetic publication takes a known, old article, and publishes it every minute.
If this breaks, we can fix it before a single real publish fails.
By basic, I mean standard
A puppet based framework
goal: for developers to reliably build & deploy services* from "zero-to-customer" in less than 15mins.
... across data centres, with monitoring...
supports multiple IaaS providers
digression:
some debate about FT Platform internally, some teams aren't using it: heroku or 'naked' AWS
personal opinion: bootstrapped this type of deployment at the FT, and at the time most developers weren't that familiar with the underlying tools, but if you are already familiar with heroku and AWS, it can feel like you're being restricted
we're now evolving FT platform to reflect that, with a move to CloudFormation and an internal tool called Konstructor that provides an API wrapper round a lot of our other tools
however:
gave us monitoring and log aggregation for any new microservice with no additional effort
nagios monitors system metrics, network protocols, applications, services, servers, and network infrastructure
alerts via email or (god forbid) SMS when there are failures and when the service recovers
you can acknowledge alerts to stop the notifications
put into maintenance mode for known downtimes
Every VM set up using FT Platform automatically forwards logs to Splunk.
Any queries you want to do across all hosts in a service, or all services that take part in a particular event is easy to do without having to jump onto the relevant box
We use it to
identify problems and alert
visualise performance or load
create dashboards for particular services
But more recently, we're moving away from Splunk dashboards..
And instead we're graphing our metrics using Graphite and Grafana.
We're using Dropwizard for our Java apps and that comes with codahale metrics embedded. It's a small config change to write those metrics to a graphite server...
Graphite isn't particularly pretty - you can see all the metrics and compose graphs on the fly...
... but by using Grafana on top of it you can easily create beautiful custom dashboards...
.They're quick to load as well
This shows one of our Read API components, so we're interested in server errors, client errors, successful requests
And also request rate across hosts.
Interesting - here, the traffic started to switch over from one data centre to another, I have no idea why!
We were using Splunk to pick up ERROR level logs
The problem there is that every ERROR results in an alert.
You might be more hardcore about this than me, but unless you have zero tolerance of ERROR logs, there will be times when there are some errors that aren't a priority - they don't represent a major issue and there aren't that many
We got some of those from the client we use to talk to Kafka
We were ignoring them and missed a problem someone introduced that also caused ERROR logs.
That wouldn't happen in Sentry or equivalent tools, because each new error TYPE results in an alert.
Again, sending information for a Dropwizard app to Sentry is a simple configuration to send logs out to the sentry API
OK, so that's the basic tools...
If the basic tools aren't giving you what you need, build your own.
This is easier if those basic tools have good APIs - because you can create your own view easily
Our first 'extra' tool was created by one of our integration engineers - he turned up with it one day…
SAWS
Built using Blinky tape - a programmable LED strip
Each section represents a different part of our system
Things light up when there's a problem, and when there isn't a problem, the blue lights swoosh back and forth so you know the monitoring is still running.
It used to be really cool and run on a Raspberry Pi - it's a Python script - but that broke and now it runs on an old Windows box under someone's desk.
So why did Silvano create this?
First off, frustrations with the number of emails...
Which he was sending straight to the bin...
And secondly, frustration with monitoring screens
He wanted something that was easy to instantly see if there was a problem
This is SAWS up in our office.
It's pretty simple - red indicates something bad has happened.
and he also changed from green to blue after this to make sure everyone can see if there's a problem…
It's not really this bright :)
So that was our first tool. Our second tool addresses the problem of waiting for screens to cycle through to see the one you want to see - by providing a single screen that can tell you what you need to know...
Dashing is a Sinatra based framework that lets you build beautiful dashboards.
Originally built by Shopify for showing things on monitors around the office
Adopted by the FT - lots of things we care about are very easy to add as tiles:
nagios (monitoring)
jenkins (build and deployment)
pingdom (website monitoring)
And it's not hard to add a new widget to integrate another system.
This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
Dashing is a Sinatra based framework that lets you build beautiful dashboards.
Originally built by Shopify for showing things on monitors around the office
Adopted by the FT - lots of things we care about are very easy to add as tiles:
nagios (monitoring)
jenkins (build and deployment)
pingdom (website monitoring)
And it's not hard to add a new widget to integrate another system.
This is the customised dashboard for our system. We have tiles for our nagios monitors, and for particular jenkins jobs - the ones with the dial
Nagios chart gives us the last 24 hours history for each Nagios monitor.
Means if we have intermittent errors that happen a lot, we don't miss them. And if something big happens when we're not there, we still know about it
So how does it work?
It screenscrapes Nagios for status - this is what that information looks like on nagios.
Nagios chart pings this regularly and keeps the information in memory for 24 hours (we go back that far as it lets us see what happened overnight, plus that was the limit before having to store it somewhere other than memory)
Each line is a service - in this case, it's all the services in Production on AWS for one of our teams
The name of the service, and of each check that failed, are shown on the left.
The bars on the right show the status at any point.
All failures are 'soft' failures - e.g. we don't wait for 3 failures to happen before indicating there was a problem. This allows us to see intermittent issues (but probably results in some noise)
YELLOW: WARNING status - a minor failure - e.g. a check took slightly over the max time to respond
RED: CRITICAL status - a major failure, i.e. no response for a check
BLUE: ACKED state
So here you can see a large data load happening that put strain onto all our servers - they were in a flapping state for hours. At some point, people started acknowledging the alerts
This one is worse. We had major problems in our Test environment - our graph database fell over. everything that had anything to do with graphs pretty much went down.
As it's Test, there was less acknowledging going on
Here's two major problems, one after the other - the pink vertical lines show when nagios chart couldn't connect to nagios, this was down to packet loss on our network.
The red bars were a firewall upgrade, eventually rolled back. Again, this is Test.
Nagios chart works because it uses the human ability to make sense of patterns - we generally know when things are going wrong just out of the corner of our eye
If viewed on your browser, pixel mapping takes you RIGHT to the error in nagios
It's been successful - individual teams picked it up and it's been adopted by our platform and environments team, to make it available more generally at the FT.
If it sounds interesting, let me know - it's not open sourced yet.
…
So the final comment on tools is about the tools you use for communication...
That's probably a bit harsh..
But it's certainly not email for me.
Even if you get the numbers down to a manageable level, threaded view isn't good for alerts - and it's hard to work out what they mean from this view (I realised after I took this screenshot that these aren't even alerts for my system - another team copied config and sent us all their alerts for a while)
And we are moving away from email for team communication at the FT…
We're using Slack a lot - most people have a Slack client open.
Slack has great integration tools
webhooks let you call an http endpoint and post a message
email integration fits well with existing tools - anything that can send an email can send a Slack message
One of my colleagues tried to persuade me to set up a separate channel for our alerts, not using the main team channel.
I think that's effectively saying "Put it somewhere where I can ignore it"
If you are getting so many of these alerts that it's annoying, there are two things you can do:
tune the alert (e.g. API requests, increased number of failures in a ten minute period so we tend to get this alert for real issues not network blips)
fix your broken system
One other thing I'm trying to persuade people to do is use Slack reactions to show that you've picked up an alert, and fixed it
I read that editorial teams are using Slack like this to move content through a workflow.
We tend to reach with a tick where we fixed something, with 'eyes' if we're looking into it still
But the problem I have is the creativity of developers - I have to ask people what they mean by a dancing lady...
If you put screens up that are clear in what they are showing - you'll notice when things go wrong
Non-developers on the team will also notice and tell you something's started flashing
Don't loop between screens - put something up that tells you what you need to know. Have more than one screen!
You have to keep a focus on them or they start to get untidy
Did you do something as a result of getting it? If no, delete it
Language should be clear - avoid jargon
Get rid of typos
Link to useful documentation
Get your newest developer to read it
Get someone from another team to read it
This is text for an email alert based on looking at access log response times
First of all - what a really developer title for the alert: no spaces, categorised by how often it runs rather than what it means
Next up - it MIGHT result in articles not getting published.
I want to know if it DID result in articles not getting published. Also - the business doesn't care about my Methode API microservice (which is a microservice wrapping calls over CORBA so that most people don't have to deal with CORBA)
But our alerts also have a Technical impact section...
I have no idea why we decided these and only these were the reasons for slow response times. It doesn't help me work out which of these is currently the issue.
First of all - spaces in the title!
This is better - at least I can tell that it's a publish failure, from our Methode CMS.
And I can see which articles failed.
And I can go and look at a run book for more information - in fact, the run book links to somewhere (actually a Jenkins job) where you can enter the list of UUIDs and kick off a republish process. (yes, could be automated, but sometimes you want to check it's not going to fail the second time, e.g. editors use their systems in ways we didn't predict)
All of which make it much less annoying to have to deal with an alert.
This alert goes to some people in our editorial department, so they can check status and republish
So whenever you get an alert, really look at it
If someone had to come and tell you your system is broken, you probably need to find a way to know first the next time
Although… for some things, a slack channel that people know about is pretty good
Maybe you need to create a synthetic request, or add the right logs and create a Splunk alert
Here - something that picked up when the percentage of failures increased told us we had a problem
We've had a case where the integration that tells us when an article is published broke
Our monitoring starts from that notification
We found out via manual testing 3 days later
We asked the CMS team to add their own monitoring - but we also added a brute force test ourselves - "did we see any blog publishes in the last day?"
I managed to turn off our publication failure alerts because I "improved" some logging
We worked this out when part of our data centre went down but we didn't see these alerts firing
If your log entry is the basis for an alert, add a unit test that will fail if it's changed and explicitly says what the impact is
Maybe you take down one of your systems and check that you can tell the impact from the alerts you get
Maybe do this blind and see how quickly the Ops Cop can work out what's broken (we haven't done this but I'd like to)
Or you can take part in company exercises - the FT took down one of our data centres earlier this year.
We'd built up to it with smaller tests, did it an an agreed date, and made sure the right people were available. Crucially, every issue we found was worked through.
We turned off a different data centre last weekend.
For us as developers, a few weeks before we started thinking about what might happen.
We KNEW we didn't have resilience for one part of our system as part of a phased approach to delivery.
However, when we started to think about what was going to happen, we found several unexpected reasons why we weren't going to have a working system (bad configuration, mostly) - we had those fixed before the day.
Netflix have their Chaos Monkey for testing resilience by randomly killing instances and services (in fact they have an entire Simian Army to test resilience at different levels)
The FT has its own Chaos Snail. If you're wondering why it's called that, it's smaller-scale than the chaos monkey, and it's written in shell
This runs on a virtual machine, kills processes as root, and records its work. It's a good way to see if your alerts are working.
It needs to be at least as available as the system it's monitoring
This is something that took us a while to really get to grips with. But if the monitoring system is down, you have no idea what the state of your system is.
…
So that's it from me in terms of advice, so I guess the question is...
Zero emails from Nagios - we have our inbox back!
We rely on our other tools
We can't miss them. They are genuine alerts
So we can see how we're doing on response times and error rates at any point
There are lots of good reasons to do that
But realise what it means to support them
Think about it from the start
Make sure you have the right tools
Continue to cultivate your alerts
Our company page on Stack overflow describes our technologies and the culture of our Technology department
We also have a technology blog where we talk about some of the things we're trying out
We have lots of our code on github and are doing this more and more