Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DISTRIBUTED SYSTEMS AT SCALE:
REDUCINGTHE FAIL
Kim Moir, Mozilla, @kmoir
URES, November 13, 2015
Water pipe

I often think of our continuous integration system as analogous to a municipal water system. Some days, a sewa...
I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s no...
–Donatella H. Meadows, Thinking in Systems
“A system is a set of things…interconnected in such a
way that they produce the...
WHAT DO WE OPTIMIZE FOR?
One of the questions the book asks is “What are we optimizing for?
WHAT DO WE OPTIMIZE FOR?
• Budget
• Shipping
• End to end time (Developer productivity)
How much budget do we have availab...
WHAT ARETHE
CONSTRAINTS?
Another question the books asks is, what are the constraints the system?
WHAT ARETHE
CONSTRAINTS?
• Budget
• Time
Budget for in-house hardware pools and AWS bill

Time for us to optimize the syst...
1. UNPREDICTABLE INPUT
1

Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world c...
2. NO CANARYTESTING
Every night, we generate new AMIs from our Puppet configs for the Amazon instance types we use. These i...
3. NOT ALL AUTO SCALING
3

Several works in progress on this front

Our infrastructure does not autoscale for some platfor...
4. SELF-INFLICTED DENIAL OF
SERVICE
Engineering for the long game: Managing complexity in distributed systems. Astrid Atki...
5.TOO MUCHTEST LOAD
We run builds on commit. There is some coaslescing on branches. Since we build and test for many platf...
6. SYSTEM RESOURCE LEAKS
I often think that managing a large scale distributed system is like being a water or sanitation ...
Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages startin...
7.TIMETO RECOVER FROM
FAILURE
Our CI system is not a bank’s transaction system or a large commerce website. But still, it ...
8. MONITORING
nagios which alerts to irc

papertrail

email alerts

dashboard

treeherder

new relic (doesn’t really apply...
We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty c...
We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our ...
9. DUPLICATE JOBS
Duplicate bits, wasted time, resources

We currently build a release twice - once in CI, once as a relea...
10.
SCHEDULING
Adding a new platform or new suites of tests currently requires release engineering intervention. We want t...
CONCLUSION
• Caring for a large distributed system is like taking
care of a city’s water and sewage
• Increase throughput ...
FURTHER READING
• AllYour Base 2014: LauraThomson, Director of Engineering, Cloud Services Engineering and
Operations, Moz...
Upcoming SlideShare
Loading in …5
×

Distributed Systems at Scale: Reducing the Fail

This talk looks at the major problem's Mozilla's continuous integration farm and the plans we have to fix these issues. This talk was given at USENIX release engineering summit in Washington DC on November 13, 2015.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Distributed Systems at Scale: Reducing the Fail

  1. 1. DISTRIBUTED SYSTEMS AT SCALE: REDUCINGTHE FAIL Kim Moir, Mozilla, @kmoir URES, November 13, 2015
  2. 2. Water pipe I often think of our continuous integration system as analogous to a municipal water system. Some days, a sewage system. We control the infrastructure that we provide but it is constrained. If someone overloads the system with inputs, we will have problems. Picture by wili_hybrid - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) https://www.flickr.com/photos/wili/3958348428/sizes/l
  3. 3. I recently read a book called “Thinking in Systems”. It’s a generalized look at complex systems and how they work. It’s not specific to computer science. Picture: https://upload.wikimedia.org/wikipedia/commons/b/bf/Slinky_rainbow.jpg Creative commons 2.0
  4. 4. –Donatella H. Meadows, Thinking in Systems “A system is a set of things…interconnected in such a way that they produce their own pattern of behaviour over time.” This system may be impacted by outside forces The response to these forces is characteristic of of the system itself That response is seldom simple in the real world These same inputs would result in a different behaviour a different system
  5. 5. WHAT DO WE OPTIMIZE FOR? One of the questions the book asks is “What are we optimizing for?
  6. 6. WHAT DO WE OPTIMIZE FOR? • Budget • Shipping • End to end time (Developer productivity) How much budget do we have available to spend our our continuous integration farm? Can we ship a release to fix a 0-day security issue in less than 24 hours? Are developers getting their test results quickly that they remain productive?
  7. 7. WHAT ARETHE CONSTRAINTS? Another question the books asks is, what are the constraints the system?
  8. 8. WHAT ARETHE CONSTRAINTS? • Budget • Time Budget for in-house hardware pools and AWS bill Time for us to optimize the system Time for developers to wait for their results I’m going to talk now about how the pain points in this large distributed system. How it can fail in a spectacular fashion.
  9. 9. 1. UNPREDICTABLE INPUT 1 Picture is a graph of monthly branch load. We have daily spikes as Mozillians across the world come online and start pushing code. The troughs are weekends. Complex system, release engineering does not control all the inputs. i.e. Let’s increase test load by 50% on one platform but not increase the hardware pool by a corresponding amount. Is someone abusing the try server? Pushes are not coalesced on try. Occasionally someone will (re)trigger a large of jobs on a single changeset. People who do this often and with good reason usually do so on weekends when there is less contention for the infrastructure. If not, the pending counts can get very high, especially for in-house pools where we can’t burst capacity. Solution: Implementing smarter (semi-automatic) test selection. cmanchester work http://chmanchester.github.io/blog/2015/08/06/defining-semi-automatic-test- prioritization/ Bug https://bugzilla.mozilla.org/show_bug.cgi?id=1184405
  10. 10. 2. NO CANARYTESTING Every night, we generate new AMIs from our Puppet configs for the Amazon instance types we use. These images are used to instantiate new instances on Amazon. We have scripts to recycle the instances with old AMIs after the new ones come available. Which is great, however, we don’t have any canary testing for the new AMIs. So we can have something happen like this 1) Someone releases a Puppet patch that passes tests 2) However, it the AMIs it generates has a permission issue 3) Which prevents all new AMIs from starting the process that connects the test instances to their server 4) So we have thousands of instances up burning money that aren’t doing anything 5) Pending counts go up but it looks like there plenty of machines but pending counts continue to rise Failure: All the AWS images are coming up but no builds are running. Solution: We need to implement canary testing for AMIs (Implement a methodology for rolling out new amis in a tiered, trackable, fashion. https://bugzilla.mozilla.org/ show_bug.cgi?id=1146369) (Add automated testing and tiered roll-outs to golden ami generation) Picture by ross_stachan https://www.flickr.com/photos/ross_strachan/6176512880/sizes/l Creative commons 2.0
  11. 11. 3. NOT ALL AUTO SCALING 3 Several works in progress on this front Our infrastructure does not autoscale for some platforms. Macs tests. (Can’t run tests in on Macs in virtualized environment on non-Mac hardware due to Apple licensing restrictions.) So we have racks of mac minis. AWS doesn’t allow the licenses for Windows versions we need to test. Also, we can’t run performance tests on cloud instances because the results are not consistent. Cannot easily move test machines between pool dynamically. The importance of a platform shifts over time. We need reallocate machines between testing pools. This is kind of a manual process. We decided to focus on Linux 64 perf testing, freeing up Linux 32 machines for use in windows testing. This required code changing, imaging new machines, trying to fix imaging process which had some bugs. Solutions: Run Windows builds in AWS (now in production on limited branches). Cross compile Mac builds on Linux. Todo: add bug references. Still, Mac and Windows tests are a problem because we need to have the in-house capacity for peak load which is expensive, or not buy for peak load and deal with a number of pending jobs. Picture "DB-Laaeks25804366678-7". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:DB-Laaeks25804366678-7.JPG#/media/File:DB-
  12. 12. 4. SELF-INFLICTED DENIAL OF SERVICE Engineering for the long game: Managing complexity in distributed systems. Astrid Atkinson https://www.youtube.com/watch?v=p0jGmgIrf_M 18:47 “Google itself is the biggest source of denial of service attacks” At Mozilla, we are no different. We retry jobs when they fail due to infrastructure reasons. Which is okay because perhaps there is a machine which is wonky and needs to be removed from the pool. And the next time it runs it will run on a machine that is in a clean state. Human error. What has changed? Permissions, network and DNS. Automatic tests of all commits applied to our production infrastructure. Example: IT redirected a server name to a new host where we didn’t have ssh keys deployed. We understood the change as redirect to a new cname, not a new host. Jobs spiked because they retried trying to fetch this zip. Solution: better monitoring of retries, communication, check for retry spike vs regular jobs and alert Picture http://www.flickr.com/photos/ervins_strauhmanis/9554405492/sizes/l/
  13. 13. 5.TOO MUCHTEST LOAD We run builds on commit. There is some coaslescing on branches. Since we build and test for many platforms, we can run up to 500 jobs (all platforms excluding talos) on a single commit. To many jobs for our infrastructure to handle - high pending counts. How can we intelligently shed load? Solution: Do we really need to run every test on every commit given that many of them don’t historically reveal problems? We have project called SETA which analyzes historical test data and we have implemented changes to our scheduler to accommodate this. Basically we can reduce the frequency of specified test runs on a per platform, per branch basis. This allows us to shed test load and increase the throughput of the system. Picture: https://www.flickr.com/photos/encouragement/14759554777/
  14. 14. 6. SYSTEM RESOURCE LEAKS I often think that managing a large scale distributed system is like being a water or sanitation engineer for a city. Ironically, LinkedIn thinks so too and advertises these jobs to me. Where are the leaks happening? Where they would look for wasted water, we look for wasted computing resources. We recently had a problem where our windows test pending counts spiked quite drastically. We initially thought this was just due to some more test jobs being added to the platform in parallel while old corresponding tests were not disabled. In fact, the major root cause of the problem was that the additional tests caused additional overhead on the server responsible for managing jobs on the tests machines. Basically the time from when a test machine finishes a task and the server responds was getting very long, which if you multiply this by hundreds of machines and thousands of jobs, leads to a long backlog. Solution: This issue was resolved by adding additional servers to the pool that services these test machines, upgrading the RAM on each of them, and increasing the interval at which they are restarted. We run many tests in parallel to reduce the end to end time that a build and associated tests take. Hard to balance start up time with time to run actual tests. Running more tests in parallel means balanced with startup time. Solution: Implement more monitoring each time a system resource leak causes an outage Test chunking. chunking by run time. Picture by Korona Lacasse - Creative Commons 2.0 Attribution 2.0 Generic https://www.flickr.com/photos/korona4reel/14107877324/sizes/l
  15. 15. Another system resource leak that was plugged by the use of a tool called runner. Runner is a project that manages starting tasks in a defined order. https://github.com/ mozilla/build-runner. Basically it ensures that our the machines are in a sane state to run a job. If they are in a good state, we don’t reboot them as often which increases the overall throughput of the system.
  16. 16. 7.TIMETO RECOVER FROM FAILURE Our CI system is not a bank’s transaction system or a large commerce website. But still, it meets most of the characteristics of a highly available system. We have some issues regarding failover, we still have single points of failure in our system. A few weeks ago, I watched a talk by Paul Hinze of HashiCorp on the primitives of High Availability. Primitives of high availability talk 42:20 https://speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability He states that it is inevitable that the system will fail, but the real measure is how quickly we can recover from it. In terms of managing failure, we have managed to decouple some components in the way that we manage our code repositories. If, for instance, bad code landed on mozilla-inbound, which causes all the tests to fail, our code sheriffs can close this tree, and leave other repositories open. However, we still have many things that are a single point of failure. For instance, our hardware in our data centre or associated network. Given that jobs automatically retry if they fail for infrastructure reasons, this allows us to bring the system up without a lot of intervention. Solution: Distributed failure - branching model and closing trees Picture by Mike Green - Creative Commons 2.0 Attribution-NonCommercial 2.0 Generic https://www.flickr.com/photos/30751204@N06/7328288188/sizes/l
  17. 17. 8. MONITORING nagios which alerts to irc papertrail email alerts dashboard treeherder new relic (doesn’t really apply to releng) Picture is Mystery by ©Stuart Richards, Creative Commons by-nc-sa 2.0 https://www.flickr.com/photos/left-hand/6883500405/
  18. 18. We have a dedicated channel for alerts with the state of our build farm. Nagios alerts send an message to the #buildduty channel. Due the large size of our devices on our build farm, most of these alerts are threshold alerts. We don’t care if a single machine goes down. It will be automatically rebooted and if this doesn’t work, a bug will be opened for a person to look at it. However, we do care if 200 of them suddenly stop taking jobs. You can see from this page that we have threshold alerts for the number of pending jobs. If this is a sustained spike, we need to look at it in further detail. We also have alerts for things like checking that the golden images that we create each night for or Amazon instances are actually completing. And that the process that kills old Amazon instances if their capacity is not needed is indeed running. We use papertrail as a log aggregator that allows us to quickly search our logs for issues
  19. 19. We use graphite for analytics that allow us to look at long term trends and spikes. For instance, this graph looks at our overall infrastructure time, green is EC2, blue is in- house hardware. Problem: All of this data is sometimes overwhelming. Every time we have an outage around something we haven’t monitored previously, we add another alert or additional monitoring. I don’t really know the solution for dealing with the flood of data other than alert on only important things, aggregate alerts for machines classes.
  20. 20. 9. DUPLICATE JOBS Duplicate bits, wasted time, resources We currently build a release twice - once in CI, once as a release job. This is inefficient and makes releng a bottleneck to getting a release out the door. To fix this - implement release promotion! https://bugzilla.mozilla.org/show_bug.cgi?id=1118794. Same thing applies to nightly builds Picture Creative Commons https://upload.wikimedia.org/wikipedia/commons/c/c6/DNA_double_helix_45.PNG By Jerome Walker,Dennis Myts (Own work) [Public domain], via Wikimedia Commons
  21. 21. 10. SCHEDULING Adding a new platform or new suites of tests currently requires release engineering intervention. We want to make this more self-serve, and allow developers to add new tests and platforms. Solution: We are currently in the process of migrating to a new system for manages task queuing, scheduling, execution and provisioning of resources for our CI system. This system is called taskcluster. This system will allow developers to schedule new tests in tree, and will make standing up new platforms much easier. It’s a micro services architecture, jobs run in docker images that allows developers to have the same environment on their desktop, as the CI system runs. http://docs.taskcluster.net/ Picture by hehaden - Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) https://www.flickr.com/photos/hellie55/5083003751/sizes/l
  22. 22. CONCLUSION • Caring for a large distributed system is like taking care of a city’s water and sewage • Increase throughput by constraining inputs based on the outputs you want to optimize • May need to migrate to something new while keeping the existing system working You need to identify system leaks and implement monitoring For instance, in our case, we reduce test jobs to optimize end to end time Buildbot->taskcluster
  23. 23. FURTHER READING • AllYour Base 2014: LauraThomson, Director of Engineering, Cloud Services Engineering and Operations, Mozilla – Many moving parts: monitoring complex systems: https://vimeo.com/album/ 3108317/video/110088288 • Velocity Santa Clara 2015: Astrid Atkinson, Director Software Engineering, Google - Engineering for the long game: Managing complexity in distributed systems. https://www.youtube.com/watch? v=p0jGmgIrf_M • Mountain West Ruby Conference, Paul Hinze’s, Hashicorp - Primitives of High Availability https:// speakerdeck.com/phinze/smoke-and-mirrors-the-primitives-of-high-availability • Strange Loop 2015, Camille Fournier - Hopelessness and Confidence in Distributed Systems Design http://www.slideshare.net/CamilleFournier1/hopelessness-and-confidence-in-distributed-systems- design

×