SlideShare a Scribd company logo
1 of 23
Download to read offline
DISTRIBUTED SYSTEMS AT SCALE:
REDUCINGTHE FAIL
Kim Moir, Mozilla, @kmoir
URES, November 13, 2015
Water pipe

I often think of our continuous integration system as analogous to a municipal water system. Some days, a sewage system. We control the infrastructure that we
provide but it is constrained. If someone overloads the system with inputs, we will have problems. 

Picture by wili_hybrid - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.flickr.com/photos/wili/3958348428/sizes/l
I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s not specific to computer science.

Picture: https://upload.wikimedia.org/wikipedia/commons/b/bf/Slinky_rainbow.jpg Creative commons 2.0
–Donatella H. Meadows, Thinking in Systems
“A system is a set of things…interconnected in such a
way that they produce their own pattern of behaviour
over time.”
This system may be impacted by outside forces

The response to these forces is characteristic of of the system itself

That response is seldom simple in the real world

These same inputs would result in a different behaviour a different system
WHAT DO WE OPTIMIZE FOR?
One of the questions the book asks is “What are we optimizing for?
WHAT DO WE OPTIMIZE FOR?
• Budget
• Shipping
• End to end time (Developer productivity)
How much budget do we have available to spend our our continuous integration farm?

Can we ship a release to fix a 0-day security issue in less than 24 hours?

Are developers getting their test results quickly that they remain productive?
WHAT ARETHE
CONSTRAINTS?
Another question the books asks is, what are the constraints the system?
WHAT ARETHE
CONSTRAINTS?
• Budget
• Time
Budget for in-house hardware pools and AWS bill

Time for us to optimize the system

Time for developers to wait for their results

I’m going to talk now about how the pain points in this large distributed system. How it can fail in a spectacular fashion.
1. UNPREDICTABLE INPUT
1

Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world come online and start pushing code. The troughs are weekends.

Complex system, release engineering does not control all the inputs. i.e. Let’s increase test load by 50% on one platform but not increase the hardware pool by a
corresponding amount.

Is someone abusing the try server? Pushes are not coalesced on try.

Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason usually do so on weekends when there is less
contention for the infrastructure. If not, the pending counts can get very high, especially for in-house pools where we can’t burst capacity.

Solution: Implementing smarter (semi-automatic) test selection. cmanchester work http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test-
prioritization/ Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1184405
2. NO CANARYTESTING
Every night, we generate new AMIs from our Puppet configs for the Amazon instance types we use. These images are used to instantiate new instances on Amazon. We
have scripts to recycle the instances with old AMIs after the new ones come available. Which is great, however, we don’t have any canary testing for the new AMIs. 

So we can have something happen like this

1) Someone releases a Puppet patch that passes tests

2) However, it the AMIs it generates has a permission issue

3) Which prevents all new AMIs from starting the process that connects the test instances to their server

4) So we have thousands of instances up burning money that aren’t doing anything

5) Pending counts go up but it looks like there plenty of machines but pending counts continue to rise

Failure: All the AWS images are coming up but no builds are running. 

Solution: We need to implement canary testing for AMIs (Implement a methodology for rolling out new amis in a tiered, trackable, fashion. https://bugzilla.mozilla.org/
show_bug.cgi?id=1146369) (Add automated testing and tiered roll-outs to golden ami generation) 

Picture by ross_stachan https://www.flickr.com/photos/ross_strachan/6176512880/sizes/l Creative commons 2.0
3. NOT ALL AUTO SCALING
3

Several works in progress on this front

Our infrastructure does not autoscale for some platforms. Macs tests. (Can’t run tests in on Macs in virtualized environment on non-Mac hardware due to Apple licensing
restrictions.) So we have racks of mac minis. AWS doesn’t allow the licenses for Windows versions we need to test. Also, we can’t run performance tests on cloud
instances because the results are not consistent. 

Cannot easily move test machines between pool dynamically. The importance of a platform shifts over time. We need reallocate machines between testing pools. This
is kind of a manual process.

We decided to focus on Linux 64 perf testing, freeing up Linux 32 machines for use in windows testing. This required code changing, imaging new machines, trying to fix
imaging process which had some bugs. 

Solutions: Run Windows builds in AWS (now in production on limited branches). Cross compile Mac builds on Linux. Todo: add bug references. 

Still, Mac and Windows tests are a problem because we need to have the in-house capacity for peak load which is expensive, or not buy for peak load and deal with a
number of pending jobs.

Picture

"DB-Laaeks25804366678-7". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DB-Laaeks25804366678-7.JPG#/media/File:DB-
4. SELF-INFLICTED DENIAL OF
SERVICE
Engineering for the long game: Managing complexity in distributed systems. Astrid Atkinson https://www.youtube.com/watch?v=p0jGmgIrf_M 

18:47 “Google itself is the biggest source of denial of service attacks”

At Mozilla, we are no different.

We retry jobs when they fail due to infrastructure reasons. Which is okay because perhaps there is a machine which is wonky and needs to be removed from the pool.
And the next time it runs it will run on a machine that is in a clean state. 

Human error. What has changed? Permissions, network and DNS. Automatic tests of all commits applied to our production infrastructure. Example: IT redirected a
server name to a new host where we didn’t have ssh keys deployed. We understood the change as redirect to a new cname, not a new host. Jobs spiked because they
retried trying to fetch this zip.

Solution: better monitoring of retries, communication, check for retry spike vs regular jobs and alert

Picture http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/
5.TOO MUCHTEST LOAD
We run builds on commit. There is some coaslescing on branches. Since we build and test for many platforms, we can run up to 500 jobs (all platforms excluding talos)
on a single commit. 

To many jobs for our infrastructure to handle - high pending counts. How can we intelligently shed load?

Solution: Do we really need to run every test on every commit given that many of them don’t historically reveal problems? We have project called SETA which analyzes
historical test data and we have implemented changes to our scheduler to accommodate this. Basically we can reduce the frequency of specified test runs on a per
platform, per branch basis. This allows us to shed test load and increase the throughput of the system.

Picture: https://www.flickr.com/photos/encouragement/14759554777/
6. SYSTEM RESOURCE LEAKS
I often think that managing a large scale distributed system is like being a water or sanitation engineer for a city. Ironically, LinkedIn thinks so too and advertises these
jobs to me. Where are the leaks happening? Where they would look for wasted water, we look for wasted computing resources.

We recently had a problem where our windows test pending counts spiked quite drastically. We initially thought this was just due to some more test jobs being added to
the platform in parallel while old corresponding tests were not disabled. In fact, the major root cause of the problem was that the additional tests caused additional
overhead on the server responsible for managing jobs on the tests machines. Basically the time from when a test machine finishes a task and the server responds was
getting very long, which if you multiply this by hundreds of machines and thousands of jobs, leads to a long backlog.



Solution: 

This issue was resolved by adding additional servers to the pool that services these test machines, upgrading the RAM on each of them, and increasing the interval at
which they are restarted.

We run many tests in parallel to reduce the end to end time that a build and associated tests take. Hard to balance start up time with time to run actual tests. Running
more tests in parallel means balanced with startup time. 

Solution: Implement more monitoring each time a system resource leak causes an outage

Test chunking. chunking by run time. 

Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l
Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages starting tasks in a defined order. https://github.com/
mozilla/build-runner. Basically it ensures that our the machines are in a sane state to run a job. If they are in a good state, we don’t reboot them as often which increases
the overall throughput of the system.
7.TIMETO RECOVER FROM
FAILURE
Our CI system is not a bank’s transaction system or a large commerce website. But still, it meets most of the characteristics of a highly available system. We have some
issues regarding failover, we still have single points of failure in our system.

A few weeks ago, I watched a talk by Paul Hinze of HashiCorp on the primitives of High Availability. 

Primitives of high availability talk 42:20

https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability

He states that it is inevitable that the system will fail, but the real measure is how quickly we can recover from it.

In terms of managing failure, we have managed to decouple some components in the way that we manage our code repositories. If, for instance, bad code landed on
mozilla-inbound, which causes all the tests to fail, our code sheriffs can close this tree, and leave other repositories open.

However, we still have many things that are a single point of failure. For instance, our hardware in our data centre or associated network. Given that jobs automatically
retry if they fail for infrastructure reasons, this allows us to bring the system up without a lot of intervention.

Solution: Distributed failure - branching model and closing trees

Picture by Mike Green - Creative Commons 2.0 Attribution-NonCommercial 2.0 Generic 

https://www.flickr.com/photos/30751204@N06/7328288188/sizes/l
8. MONITORING
nagios which alerts to irc

papertrail

email alerts

dashboard

treeherder

new relic (doesn’t really apply to releng)

Picture is Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0

https://www.flickr.com/photos/left-hand/6883500405/
We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty channel. Due the large size of our devices on
our build farm, most of these alerts are threshold alerts. We don’t care if a single machine goes down. It will be automatically rebooted and if this doesn’t work, a bug will
be opened for a person to look at it. However, we do care if 200 of them suddenly stop taking jobs. You can see from this page that we have threshold alerts for the
number of pending jobs. If this is a sustained spike, we need to look at it in further detail. 

We also have alerts for things like checking that the golden images that we create each night for or Amazon instances are actually completing. And that the process that
kills old Amazon instances if their capacity is not needed is indeed running. We use papertrail as a log aggregator that allows us to quickly search our logs for issues
We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our overall infrastructure time, green is EC2, blue is in-
house hardware.

Problem: All of this data is sometimes overwhelming. Every time we have an outage around something we haven’t monitored previously, we add another alert or
additional monitoring. I don’t really know the solution for dealing with the flood of data other than alert on only important things, aggregate alerts for machines classes.
9. DUPLICATE JOBS
Duplicate bits, wasted time, resources

We currently build a release twice - once in CI, once as a release job. This is inefficient and makes releng a bottleneck to getting a release out the door. To fix this -
implement release promotion! https://bugzilla.mozilla.org/show_bug.cgi?id=1118794. Same thing applies to nightly builds

Picture Creative Commons

https://upload.wikimedia.org/wikipedia/commons/c/c6/DNA_double_helix_45.PNG

By Jerome Walker,Dennis Myts (Own work) [Public domain], via Wikimedia Commons
10.
SCHEDULING
Adding a new platform or new suites of tests currently requires release engineering intervention. We want to make this more self-serve, and allow developers to add new
tests and platforms. 

Solution: We are currently in the process of migrating to a new system for manages task queuing, scheduling, execution and provisioning of resources for our CI system.
This system is called taskcluster. This system will allow developers to schedule new tests in tree, and will make standing up new platforms much easier. It’s a micro
services architecture, jobs run in docker images that allows developers to have the same environment on their desktop, as the CI system runs.

http://docs.taskcluster.net/

Picture by hehaden - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0)

https://www.flickr.com/photos/hellie55/5083003751/sizes/l
CONCLUSION
• Caring for a large distributed system is like taking
care of a city’s water and sewage
• Increase throughput by constraining inputs based
on the outputs you want to optimize
• May need to migrate to something new while
keeping the existing system working
You need to identify system leaks and implement monitoring

For instance, in our case, we reduce test jobs to optimize end to end time

Buildbot->taskcluster
FURTHER READING
• AllYour Base 2014: LauraThomson, Director of Engineering, Cloud Services Engineering and
Operations, Mozilla – Many moving parts: monitoring complex systems: https://vimeo.com/album/
3108317/video/110088288
• Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for
the long game: Managing complexity in distributed systems. https://www.youtube.com/watch?
v=p0jGmgIrf_M
• Mountain West Ruby Conference, Paul Hinze’s, Hashicorp - Primitives of High Availability https://
speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability
• Strange Loop 2015, Camille Fournier - Hopelessness and Confidence in Distributed Systems Design
http://www.slideshare.net/CamilleFournier1/hopelessness-and-confidence-in-distributed-systems-
design

More Related Content

What's hot

Getting started with Jenkins
Getting started with JenkinsGetting started with Jenkins
Getting started with JenkinsEdureka!
 
Laurentiu macovei meteor. a better way of building apps
Laurentiu macovei   meteor. a better way of building appsLaurentiu macovei   meteor. a better way of building apps
Laurentiu macovei meteor. a better way of building appsCodecamp Romania
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at FlickrJohn Allspaw
 
Securing jenkins
Securing jenkinsSecuring jenkins
Securing jenkinsCloudBees
 
WebObjects Developer Tools
WebObjects Developer ToolsWebObjects Developer Tools
WebObjects Developer ToolsWO Community
 
Continuous delivery for databases - Bristol DevOps Edition
Continuous delivery for databases - Bristol DevOps EditionContinuous delivery for databases - Bristol DevOps Edition
Continuous delivery for databases - Bristol DevOps EditionDevOpsGroup
 
Scaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightScaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightRoss Snyder
 
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Ohad Basan
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsMike Brittain
 
Continuous Delivery Using Jenkins
Continuous Delivery Using JenkinsContinuous Delivery Using Jenkins
Continuous Delivery Using JenkinsCliffano Subagio
 
Puppet Release Workflows at Jive Software
Puppet Release Workflows at Jive SoftwarePuppet Release Workflows at Jive Software
Puppet Release Workflows at Jive SoftwarePuppet
 
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages Heaven
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages HeavenIBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages Heaven
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages HeavenPaul Withers
 
Why Everyone Needs DevOps Now - Gene Kim
Why Everyone Needs DevOps Now - Gene KimWhy Everyone Needs DevOps Now - Gene Kim
Why Everyone Needs DevOps Now - Gene KimDynatrace
 
Using CI for continuous delivery Part 1
Using CI for continuous delivery Part 1Using CI for continuous delivery Part 1
Using CI for continuous delivery Part 1Vishal Biyani
 
Chaos engineering applied
Chaos engineering appliedChaos engineering applied
Chaos engineering appliedRamon Anger
 
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...Peter Leschev
 
Jenkins tutorial for beginners
Jenkins tutorial for beginnersJenkins tutorial for beginners
Jenkins tutorial for beginnersBugRaptors
 
Zero downtime deploys for Rails apps
Zero downtime deploys for Rails appsZero downtime deploys for Rails apps
Zero downtime deploys for Rails appspedrobelo
 
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...Badoo
 

What's hot (20)

Getting started with Jenkins
Getting started with JenkinsGetting started with Jenkins
Getting started with Jenkins
 
Laurentiu macovei meteor. a better way of building apps
Laurentiu macovei   meteor. a better way of building appsLaurentiu macovei   meteor. a better way of building apps
Laurentiu macovei meteor. a better way of building apps
 
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
10+ Deploys Per Day: Dev and Ops Cooperation at Flickr
 
Securing jenkins
Securing jenkinsSecuring jenkins
Securing jenkins
 
WebObjects Developer Tools
WebObjects Developer ToolsWebObjects Developer Tools
WebObjects Developer Tools
 
Continuous delivery for databases - Bristol DevOps Edition
Continuous delivery for databases - Bristol DevOps EditionContinuous delivery for databases - Bristol DevOps Edition
Continuous delivery for databases - Bristol DevOps Edition
 
Scaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went RightScaling Etsy: What Went Wrong, What Went Right
Scaling Etsy: What Went Wrong, What Went Right
 
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.Jenkinsconf Presentation - Advance jenkins management with multiple projects.
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
 
Continuous Deployment: The Dirty Details
Continuous Deployment: The Dirty DetailsContinuous Deployment: The Dirty Details
Continuous Deployment: The Dirty Details
 
Continuous Delivery Using Jenkins
Continuous Delivery Using JenkinsContinuous Delivery Using Jenkins
Continuous Delivery Using Jenkins
 
Puppet Release Workflows at Jive Software
Puppet Release Workflows at Jive SoftwarePuppet Release Workflows at Jive Software
Puppet Release Workflows at Jive Software
 
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages Heaven
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages HeavenIBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages Heaven
IBM Connect 2014 BP204: It's Not Infernal: Dante's Nine Circles of XPages Heaven
 
What DevOps Isn't
What DevOps Isn'tWhat DevOps Isn't
What DevOps Isn't
 
Why Everyone Needs DevOps Now - Gene Kim
Why Everyone Needs DevOps Now - Gene KimWhy Everyone Needs DevOps Now - Gene Kim
Why Everyone Needs DevOps Now - Gene Kim
 
Using CI for continuous delivery Part 1
Using CI for continuous delivery Part 1Using CI for continuous delivery Part 1
Using CI for continuous delivery Part 1
 
Chaos engineering applied
Chaos engineering appliedChaos engineering applied
Chaos engineering applied
 
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
How Atlassian's Build Engineering Team Has Scaled to 150k Builds Per Month an...
 
Jenkins tutorial for beginners
Jenkins tutorial for beginnersJenkins tutorial for beginners
Jenkins tutorial for beginners
 
Zero downtime deploys for Rails apps
Zero downtime deploys for Rails appsZero downtime deploys for Rails apps
Zero downtime deploys for Rails apps
 
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...
iOS Parallel Automation: run faster than fast — Viktar Karanevich — SeleniumC...
 

Similar to Distributed Systems at Scale: Reducing the Fail

Scaling Engineering with Docker
Scaling Engineering with DockerScaling Engineering with Docker
Scaling Engineering with DockerTom Leach
 
Care and feeding notes
Care and feeding notesCare and feeding notes
Care and feeding notesPerrin Harkins
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
dotScale 2017 - watomation
dotScale 2017 - watomationdotScale 2017 - watomation
dotScale 2017 - watomationJames Cammarata
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Web scale architecture design
Web scale architecture designWeb scale architecture design
Web scale architecture designNepalAdz
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programsgreenwop
 
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Applitools
 
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Andreas Grabner
 
Cloud computing - an architect's perspective
Cloud computing - an architect's perspectiveCloud computing - an architect's perspective
Cloud computing - an architect's perspectiveHARMAN Services
 
L'impatto della sicurezza su DevOps
L'impatto della sicurezza su DevOpsL'impatto della sicurezza su DevOps
L'impatto della sicurezza su DevOpsGiulio Vian
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamMoving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamSofia Fateeva
 
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamMoving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamPVS-Studio
 
Performance Optimization
Performance OptimizationPerformance Optimization
Performance OptimizationNeha Thakur
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesKP Kaiser
 
Virtual Stress-free Testing in the Cloud
Virtual Stress-free Testing in the CloudVirtual Stress-free Testing in the Cloud
Virtual Stress-free Testing in the Cloudguest2e9c5f40
 

Similar to Distributed Systems at Scale: Reducing the Fail (20)

Scaling Engineering with Docker
Scaling Engineering with DockerScaling Engineering with Docker
Scaling Engineering with Docker
 
Care and feeding notes
Care and feeding notesCare and feeding notes
Care and feeding notes
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
dotScale 2017 - watomation
dotScale 2017 - watomationdotScale 2017 - watomation
dotScale 2017 - watomation
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Micro services
Micro servicesMicro services
Micro services
 
Web scale architecture design
Web scale architecture designWeb scale architecture design
Web scale architecture design
 
Performance Analysis of Idle Programs
Performance Analysis of Idle ProgramsPerformance Analysis of Idle Programs
Performance Analysis of Idle Programs
 
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
 
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...
 
Cloud computing - an architect's perspective
Cloud computing - an architect's perspectiveCloud computing - an architect's perspective
Cloud computing - an architect's perspective
 
L'impatto della sicurezza su DevOps
L'impatto della sicurezza su DevOpsL'impatto della sicurezza su DevOps
L'impatto della sicurezza su DevOps
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
First Things First
First Things FirstFirst Things First
First Things First
 
First Things First
First Things FirstFirst Things First
First Things First
 
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamMoving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
 
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development teamMoving from CruiseControl.NET to Jenkins in the PVS-Studio development team
Moving from CruiseControl.NET to Jenkins in the PVS-Studio development team
 
Performance Optimization
Performance OptimizationPerformance Optimization
Performance Optimization
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
 
Virtual Stress-free Testing in the Cloud
Virtual Stress-free Testing in the CloudVirtual Stress-free Testing in the Cloud
Virtual Stress-free Testing in the Cloud
 

Recently uploaded

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Distributed Systems at Scale: Reducing the Fail

  • 1. DISTRIBUTED SYSTEMS AT SCALE: REDUCINGTHE FAIL Kim Moir, Mozilla, @kmoir URES, November 13, 2015
  • 2. Water pipe I often think of our continuous integration system as analogous to a municipal water system. Some days, a sewage system. We control the infrastructure that we provide but it is constrained. If someone overloads the system with inputs, we will have problems. Picture by wili_hybrid - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) https://www.flickr.com/photos/wili/3958348428/sizes/l
  • 3. I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s not specific to computer science. Picture: https://upload.wikimedia.org/wikipedia/commons/b/bf/Slinky_rainbow.jpg Creative commons 2.0
  • 4. –Donatella H. Meadows, Thinking in Systems “A system is a set of things…interconnected in such a way that they produce their own pattern of behaviour over time.” This system may be impacted by outside forces The response to these forces is characteristic of of the system itself That response is seldom simple in the real world These same inputs would result in a different behaviour a different system
  • 5. WHAT DO WE OPTIMIZE FOR? One of the questions the book asks is “What are we optimizing for?
  • 6. WHAT DO WE OPTIMIZE FOR? • Budget • Shipping • End to end time (Developer productivity) How much budget do we have available to spend our our continuous integration farm? Can we ship a release to fix a 0-day security issue in less than 24 hours? Are developers getting their test results quickly that they remain productive?
  • 7. WHAT ARETHE CONSTRAINTS? Another question the books asks is, what are the constraints the system?
  • 8. WHAT ARETHE CONSTRAINTS? • Budget • Time Budget for in-house hardware pools and AWS bill Time for us to optimize the system Time for developers to wait for their results I’m going to talk now about how the pain points in this large distributed system. How it can fail in a spectacular fashion.
  • 9. 1. UNPREDICTABLE INPUT 1 Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world come online and start pushing code. The troughs are weekends. Complex system, release engineering does not control all the inputs. i.e. Let’s increase test load by 50% on one platform but not increase the hardware pool by a corresponding amount. Is someone abusing the try server? Pushes are not coalesced on try. Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason usually do so on weekends when there is less contention for the infrastructure. If not, the pending counts can get very high, especially for in-house pools where we can’t burst capacity. Solution: Implementing smarter (semi-automatic) test selection. cmanchester work http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test- prioritization/ Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1184405
  • 10. 2. NO CANARYTESTING Every night, we generate new AMIs from our Puppet configs for the Amazon instance types we use. These images are used to instantiate new instances on Amazon. We have scripts to recycle the instances with old AMIs after the new ones come available. Which is great, however, we don’t have any canary testing for the new AMIs. So we can have something happen like this 1) Someone releases a Puppet patch that passes tests 2) However, it the AMIs it generates has a permission issue 3) Which prevents all new AMIs from starting the process that connects the test instances to their server 4) So we have thousands of instances up burning money that aren’t doing anything 5) Pending counts go up but it looks like there plenty of machines but pending counts continue to rise Failure: All the AWS images are coming up but no builds are running. Solution: We need to implement canary testing for AMIs (Implement a methodology for rolling out new amis in a tiered, trackable, fashion. https://bugzilla.mozilla.org/ show_bug.cgi?id=1146369) (Add automated testing and tiered roll-outs to golden ami generation) Picture by ross_stachan https://www.flickr.com/photos/ross_strachan/6176512880/sizes/l Creative commons 2.0
  • 11. 3. NOT ALL AUTO SCALING 3 Several works in progress on this front Our infrastructure does not autoscale for some platforms. Macs tests. (Can’t run tests in on Macs in virtualized environment on non-Mac hardware due to Apple licensing restrictions.) So we have racks of mac minis. AWS doesn’t allow the licenses for Windows versions we need to test. Also, we can’t run performance tests on cloud instances because the results are not consistent. Cannot easily move test machines between pool dynamically. The importance of a platform shifts over time. We need reallocate machines between testing pools. This is kind of a manual process. We decided to focus on Linux 64 perf testing, freeing up Linux 32 machines for use in windows testing. This required code changing, imaging new machines, trying to fix imaging process which had some bugs. Solutions: Run Windows builds in AWS (now in production on limited branches). Cross compile Mac builds on Linux. Todo: add bug references. Still, Mac and Windows tests are a problem because we need to have the in-house capacity for peak load which is expensive, or not buy for peak load and deal with a number of pending jobs. Picture "DB-Laaeks25804366678-7". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DB-Laaeks25804366678-7.JPG#/media/File:DB-
  • 12. 4. SELF-INFLICTED DENIAL OF SERVICE Engineering for the long game: Managing complexity in distributed systems. Astrid Atkinson https://www.youtube.com/watch?v=p0jGmgIrf_M 18:47 “Google itself is the biggest source of denial of service attacks” At Mozilla, we are no different. We retry jobs when they fail due to infrastructure reasons. Which is okay because perhaps there is a machine which is wonky and needs to be removed from the pool. And the next time it runs it will run on a machine that is in a clean state. Human error. What has changed? Permissions, network and DNS. Automatic tests of all commits applied to our production infrastructure. Example: IT redirected a server name to a new host where we didn’t have ssh keys deployed. We understood the change as redirect to a new cname, not a new host. Jobs spiked because they retried trying to fetch this zip. Solution: better monitoring of retries, communication, check for retry spike vs regular jobs and alert Picture http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/
  • 13. 5.TOO MUCHTEST LOAD We run builds on commit. There is some coaslescing on branches. Since we build and test for many platforms, we can run up to 500 jobs (all platforms excluding talos) on a single commit. To many jobs for our infrastructure to handle - high pending counts. How can we intelligently shed load? Solution: Do we really need to run every test on every commit given that many of them don’t historically reveal problems? We have project called SETA which analyzes historical test data and we have implemented changes to our scheduler to accommodate this. Basically we can reduce the frequency of specified test runs on a per platform, per branch basis. This allows us to shed test load and increase the throughput of the system. Picture: https://www.flickr.com/photos/encouragement/14759554777/
  • 14. 6. SYSTEM RESOURCE LEAKS I often think that managing a large scale distributed system is like being a water or sanitation engineer for a city. Ironically, LinkedIn thinks so too and advertises these jobs to me. Where are the leaks happening? Where they would look for wasted water, we look for wasted computing resources. We recently had a problem where our windows test pending counts spiked quite drastically. We initially thought this was just due to some more test jobs being added to the platform in parallel while old corresponding tests were not disabled. In fact, the major root cause of the problem was that the additional tests caused additional overhead on the server responsible for managing jobs on the tests machines. Basically the time from when a test machine finishes a task and the server responds was getting very long, which if you multiply this by hundreds of machines and thousands of jobs, leads to a long backlog. Solution: This issue was resolved by adding additional servers to the pool that services these test machines, upgrading the RAM on each of them, and increasing the interval at which they are restarted. We run many tests in parallel to reduce the end to end time that a build and associated tests take. Hard to balance start up time with time to run actual tests. Running more tests in parallel means balanced with startup time. Solution: Implement more monitoring each time a system resource leak causes an outage Test chunking. chunking by run time. Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l
  • 15. Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages starting tasks in a defined order. https://github.com/ mozilla/build-runner. Basically it ensures that our the machines are in a sane state to run a job. If they are in a good state, we don’t reboot them as often which increases the overall throughput of the system.
  • 16. 7.TIMETO RECOVER FROM FAILURE Our CI system is not a bank’s transaction system or a large commerce website. But still, it meets most of the characteristics of a highly available system. We have some issues regarding failover, we still have single points of failure in our system. A few weeks ago, I watched a talk by Paul Hinze of HashiCorp on the primitives of High Availability. Primitives of high availability talk 42:20 https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability He states that it is inevitable that the system will fail, but the real measure is how quickly we can recover from it. In terms of managing failure, we have managed to decouple some components in the way that we manage our code repositories. If, for instance, bad code landed on mozilla-inbound, which causes all the tests to fail, our code sheriffs can close this tree, and leave other repositories open. However, we still have many things that are a single point of failure. For instance, our hardware in our data centre or associated network. Given that jobs automatically retry if they fail for infrastructure reasons, this allows us to bring the system up without a lot of intervention. Solution: Distributed failure - branching model and closing trees Picture by Mike Green - Creative Commons 2.0 Attribution-NonCommercial 2.0 Generic https://www.flickr.com/photos/30751204@N06/7328288188/sizes/l
  • 17. 8. MONITORING nagios which alerts to irc papertrail email alerts dashboard treeherder new relic (doesn’t really apply to releng) Picture is Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0 https://www.flickr.com/photos/left-hand/6883500405/
  • 18. We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty channel. Due the large size of our devices on our build farm, most of these alerts are threshold alerts. We don’t care if a single machine goes down. It will be automatically rebooted and if this doesn’t work, a bug will be opened for a person to look at it. However, we do care if 200 of them suddenly stop taking jobs. You can see from this page that we have threshold alerts for the number of pending jobs. If this is a sustained spike, we need to look at it in further detail. We also have alerts for things like checking that the golden images that we create each night for or Amazon instances are actually completing. And that the process that kills old Amazon instances if their capacity is not needed is indeed running. We use papertrail as a log aggregator that allows us to quickly search our logs for issues
  • 19. We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our overall infrastructure time, green is EC2, blue is in- house hardware. Problem: All of this data is sometimes overwhelming. Every time we have an outage around something we haven’t monitored previously, we add another alert or additional monitoring. I don’t really know the solution for dealing with the flood of data other than alert on only important things, aggregate alerts for machines classes.
  • 20. 9. DUPLICATE JOBS Duplicate bits, wasted time, resources We currently build a release twice - once in CI, once as a release job. This is inefficient and makes releng a bottleneck to getting a release out the door. To fix this - implement release promotion! https://bugzilla.mozilla.org/show_bug.cgi?id=1118794. Same thing applies to nightly builds Picture Creative Commons https://upload.wikimedia.org/wikipedia/commons/c/c6/DNA_double_helix_45.PNG By Jerome Walker,Dennis Myts (Own work) [Public domain], via Wikimedia Commons
  • 21. 10. SCHEDULING Adding a new platform or new suites of tests currently requires release engineering intervention. We want to make this more self-serve, and allow developers to add new tests and platforms. Solution: We are currently in the process of migrating to a new system for manages task queuing, scheduling, execution and provisioning of resources for our CI system. This system is called taskcluster. This system will allow developers to schedule new tests in tree, and will make standing up new platforms much easier. It’s a micro services architecture, jobs run in docker images that allows developers to have the same environment on their desktop, as the CI system runs. http://docs.taskcluster.net/ Picture by hehaden - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) https://www.flickr.com/photos/hellie55/5083003751/sizes/l
  • 22. CONCLUSION • Caring for a large distributed system is like taking care of a city’s water and sewage • Increase throughput by constraining inputs based on the outputs you want to optimize • May need to migrate to something new while keeping the existing system working You need to identify system leaks and implement monitoring For instance, in our case, we reduce test jobs to optimize end to end time Buildbot->taskcluster
  • 23. FURTHER READING • AllYour Base 2014: LauraThomson, Director of Engineering, Cloud Services Engineering and Operations, Mozilla – Many moving parts: monitoring complex systems: https://vimeo.com/album/ 3108317/video/110088288 • Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for the long game: Managing complexity in distributed systems. https://www.youtube.com/watch? v=p0jGmgIrf_M • Mountain West Ruby Conference, Paul Hinze’s, Hashicorp - Primitives of High Availability https:// speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability • Strange Loop 2015, Camille Fournier - Hopelessness and Confidence in Distributed Systems Design http://www.slideshare.net/CamilleFournier1/hopelessness-and-confidence-in-distributed-systems- design