Distributed Systems at Scale: Reducing the Fail

DISTRIBUTED SYSTEMS AT SCALE:
REDUCINGTHE FAIL
Kim Moir, Mozilla, @kmoir
URES, November 13, 2015

Water pipe

I often think of our continuous integration system as analogous to a municipal water system. Some days, a sewage system. We control the infrastructure that we
provide but it is constrained. If someone overloads the system with inputs, we will have problems.

Picture by wili_hybrid - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.ﬂickr.com/photos/wili/3958348428/sizes/l

I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s not speciﬁc to computer science.

Picture: https://upload.wikimedia.org/wikipedia/commons/b/bf/Slinky_rainbow.jpg Creative commons 2.0

–Donatella H. Meadows, Thinking in Systems
“A system is a set of things…interconnected in such a
way that they produce their own pattern of behaviour
over time.”
This system may be impacted by outside forces

The response to these forces is characteristic of of the system itself

That response is seldom simple in the real world

These same inputs would result in a diﬀerent behaviour a diﬀerent system

WHAT DO WE OPTIMIZE FOR?
One of the questions the book asks is “What are we optimizing for?

WHAT DO WE OPTIMIZE FOR?
• Budget
• Shipping
• End to end time (Developer productivity)
How much budget do we have available to spend our our continuous integration farm?

Can we ship a release to ﬁx a 0-day security issue in less than 24 hours?

Are developers getting their test results quickly that they remain productive?

WHAT ARETHE
CONSTRAINTS?
Another question the books asks is, what are the constraints the system?

WHAT ARETHE
CONSTRAINTS?
• Budget
• Time
Budget for in-house hardware pools and AWS bill

Time for us to optimize the system

Time for developers to wait for their results

I’m going to talk now about how the pain points in this large distributed system. How it can fail in a spectacular fashion.

1. UNPREDICTABLE INPUT
1

Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world come online and start pushing code. The troughs are weekends.

Complex system, release engineering does not control all the inputs. i.e. Let’s increase test load by 50% on one platform but not increase the hardware pool by a
corresponding amount.

Is someone abusing the try server? Pushes are not coalesced on try.

Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason usually do so on weekends when there is less
contention for the infrastructure. If not, the pending counts can get very high, especially for in-house pools where we can’t burst capacity.

Solution: Implementing smarter (semi-automatic) test selection. cmanchester work http://chmanchester.github.io/blog/2015/08/06/deﬁning-semi-automatic-test-
prioritization/ Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1184405

2. NO CANARYTESTING
Every night, we generate new AMIs from our Puppet conﬁgs for the Amazon instance types we use. These images are used to instantiate new instances on Amazon. We
have scripts to recycle the instances with old AMIs after the new ones come available. Which is great, however, we don’t have any canary testing for the new AMIs.

So we can have something happen like this

1) Someone releases a Puppet patch that passes tests

2) However, it the AMIs it generates has a permission issue

3) Which prevents all new AMIs from starting the process that connects the test instances to their server

4) So we have thousands of instances up burning money that aren’t doing anything

5) Pending counts go up but it looks like there plenty of machines but pending counts continue to rise

Failure: All the AWS images are coming up but no builds are running.

Solution: We need to implement canary testing for AMIs (Implement a methodology for rolling out new amis in a tiered, trackable, fashion. https://bugzilla.mozilla.org/
show_bug.cgi?id=1146369) (Add automated testing and tiered roll-outs to golden ami generation)

Picture by ross_stachan https://www.ﬂickr.com/photos/ross_strachan/6176512880/sizes/l Creative commons 2.0

3. NOT ALL AUTO SCALING
3

Several works in progress on this front

Our infrastructure does not autoscale for some platforms. Macs tests. (Can’t run tests in on Macs in virtualized environment on non-Mac hardware due to Apple licensing
restrictions.) So we have racks of mac minis. AWS doesn’t allow the licenses for Windows versions we need to test. Also, we can’t run performance tests on cloud
instances because the results are not consistent.

Cannot easily move test machines between pool dynamically. The importance of a platform shifts over time. We need reallocate machines between testing pools. This
is kind of a manual process.

We decided to focus on Linux 64 perf testing, freeing up Linux 32 machines for use in windows testing. This required code changing, imaging new machines, trying to ﬁx
imaging process which had some bugs.

Solutions: Run Windows builds in AWS (now in production on limited branches). Cross compile Mac builds on Linux. Todo: add bug references.

Still, Mac and Windows tests are a problem because we need to have the in-house capacity for peak load which is expensive, or not buy for peak load and deal with a
number of pending jobs.

Picture

"DB-Laaeks25804366678-7". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DB-Laaeks25804366678-7.JPG#/media/File:DB-

4. SELF-INFLICTED DENIAL OF
SERVICE
Engineering for the long game: Managing complexity in distributed systems. Astrid Atkinson https://www.youtube.com/watch?v=p0jGmgIrf_M

18:47 “Google itself is the biggest source of denial of service attacks”

At Mozilla, we are no diﬀerent.

We retry jobs when they fail due to infrastructure reasons. Which is okay because perhaps there is a machine which is wonky and needs to be removed from the pool.
And the next time it runs it will run on a machine that is in a clean state.

Human error. What has changed? Permissions, network and DNS. Automatic tests of all commits applied to our production infrastructure. Example: IT redirected a
server name to a new host where we didn’t have ssh keys deployed. We understood the change as redirect to a new cname, not a new host. Jobs spiked because they
retried trying to fetch this zip.

Solution: better monitoring of retries, communication, check for retry spike vs regular jobs and alert

Picture http://www.ﬂickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/

5.TOO MUCHTEST LOAD
We run builds on commit. There is some coaslescing on branches. Since we build and test for many platforms, we can run up to 500 jobs (all platforms excluding talos)
on a single commit.

To many jobs for our infrastructure to handle - high pending counts. How can we intelligently shed load?

Solution: Do we really need to run every test on every commit given that many of them don’t historically reveal problems? We have project called SETA which analyzes
historical test data and we have implemented changes to our scheduler to accommodate this. Basically we can reduce the frequency of speciﬁed test runs on a per
platform, per branch basis. This allows us to shed test load and increase the throughput of the system.

Picture: https://www.ﬂickr.com/photos/encouragement/14759554777/

6. SYSTEM RESOURCE LEAKS
I often think that managing a large scale distributed system is like being a water or sanitation engineer for a city. Ironically, LinkedIn thinks so too and advertises these
jobs to me. Where are the leaks happening? Where they would look for wasted water, we look for wasted computing resources.

We recently had a problem where our windows test pending counts spiked quite drastically. We initially thought this was just due to some more test jobs being added to
the platform in parallel while old corresponding tests were not disabled. In fact, the major root cause of the problem was that the additional tests caused additional
overhead on the server responsible for managing jobs on the tests machines. Basically the time from when a test machine ﬁnishes a task and the server responds was
getting very long, which if you multiply this by hundreds of machines and thousands of jobs, leads to a long backlog.

Solution:

This issue was resolved by adding additional servers to the pool that services these test machines, upgrading the RAM on each of them, and increasing the interval at
which they are restarted.

We run many tests in parallel to reduce the end to end time that a build and associated tests take. Hard to balance start up time with time to run actual tests. Running
more tests in parallel means balanced with startup time.

Solution: Implement more monitoring each time a system resource leak causes an outage

Test chunking. chunking by run time.

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.ﬂickr.com/photos/korona4reel/14107877324/sizes/l

Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages starting tasks in a deﬁned order. https://github.com/
mozilla/build-runner. Basically it ensures that our the machines are in a sane state to run a job. If they are in a good state, we don’t reboot them as often which increases
the overall throughput of the system.

7.TIMETO RECOVER FROM
FAILURE
Our CI system is not a bank’s transaction system or a large commerce website. But still, it meets most of the characteristics of a highly available system. We have some
issues regarding failover, we still have single points of failure in our system.

A few weeks ago, I watched a talk by Paul Hinze of HashiCorp on the primitives of High Availability.

Primitives of high availability talk 42:20

https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability

He states that it is inevitable that the system will fail, but the real measure is how quickly we can recover from it.

In terms of managing failure, we have managed to decouple some components in the way that we manage our code repositories. If, for instance, bad code landed on
mozilla-inbound, which causes all the tests to fail, our code sheriﬀs can close this tree, and leave other repositories open.

However, we still have many things that are a single point of failure. For instance, our hardware in our data centre or associated network. Given that jobs automatically
retry if they fail for infrastructure reasons, this allows us to bring the system up without a lot of intervention.

Solution: Distributed failure - branching model and closing trees

Picture by Mike Green - Creative Commons 2.0 Attribution-NonCommercial 2.0 Generic

https://www.ﬂickr.com/photos/30751204@N06/7328288188/sizes/l

8. MONITORING
nagios which alerts to irc

papertrail

email alerts

dashboard

treeherder

new relic (doesn’t really apply to releng)

Picture is Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

https://www.ﬂickr.com/photos/left-hand/6883500405/

We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty channel. Due the large size of our devices on
our build farm, most of these alerts are threshold alerts. We don’t care if a single machine goes down. It will be automatically rebooted and if this doesn’t work, a bug will
be opened for a person to look at it. However, we do care if 200 of them suddenly stop taking jobs. You can see from this page that we have threshold alerts for the
number of pending jobs. If this is a sustained spike, we need to look at it in further detail.

We also have alerts for things like checking that the golden images that we create each night for or Amazon instances are actually completing. And that the process that
kills old Amazon instances if their capacity is not needed is indeed running. We use papertrail as a log aggregator that allows us to quickly search our logs for issues

We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our overall infrastructure time, green is EC2, blue is in-
house hardware.

Problem: All of this data is sometimes overwhelming. Every time we have an outage around something we haven’t monitored previously, we add another alert or
additional monitoring. I don’t really know the solution for dealing with the ﬂood of data other than alert on only important things, aggregate alerts for machines classes.

9. DUPLICATE JOBS
Duplicate bits, wasted time, resources

We currently build a release twice - once in CI, once as a release job. This is ineﬃcient and makes releng a bottleneck to getting a release out the door. To ﬁx this -
implement release promotion! https://bugzilla.mozilla.org/show_bug.cgi?id=1118794. Same thing applies to nightly builds

Picture Creative Commons

https://upload.wikimedia.org/wikipedia/commons/c/c6/DNA_double_helix_45.PNG

By Jerome Walker,Dennis Myts (Own work) [Public domain], via Wikimedia Commons

10.
SCHEDULING
Adding a new platform or new suites of tests currently requires release engineering intervention. We want to make this more self-serve, and allow developers to add new
tests and platforms.

Solution: We are currently in the process of migrating to a new system for manages task queuing, scheduling, execution and provisioning of resources for our CI system.
This system is called taskcluster. This system will allow developers to schedule new tests in tree, and will make standing up new platforms much easier. It’s a micro
services architecture, jobs run in docker images that allows developers to have the same environment on their desktop, as the CI system runs.

http://docs.taskcluster.net/

Picture by hehaden - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.ﬂickr.com/photos/hellie55/5083003751/sizes/l

CONCLUSION
• Caring for a large distributed system is like taking
care of a city’s water and sewage
• Increase throughput by constraining inputs based
on the outputs you want to optimize
• May need to migrate to something new while
keeping the existing system working
You need to identify system leaks and implement monitoring

For instance, in our case, we reduce test jobs to optimize end to end time

Buildbot->taskcluster

FURTHER READING
• AllYour Base 2014: LauraThomson, Director of Engineering, Cloud Services Engineering and
Operations, Mozilla – Many moving parts: monitoring complex systems: https://vimeo.com/album/
3108317/video/110088288
• Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for
the long game: Managing complexity in distributed systems. https://www.youtube.com/watch?
v=p0jGmgIrf_M
• Mountain West Ruby Conference, Paul Hinze’s, Hashicorp - Primitives of High Availability https://
speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability
• Strange Loop 2015, Camille Fournier - Hopelessness and Conﬁdence in Distributed Systems Design
http://www.slideshare.net/CamilleFournier1/hopelessness-and-conﬁdence-in-distributed-systems-
design

Distributed Systems at Scale: Reducing the Fail

More Related Content

What's hot

Similar to Distributed Systems at Scale: Reducing the Fail

Recently uploaded

Distributed Systems at Scale: Reducing the Fail