Building Cloud Tools for Netflix with Jenkins


Published on

Brian Moyles and Gareth Bowles from Netflix describe the continuous integration system that lets us build and deploy the Netflix streaming service fast and at scale.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Abstract: Over the last couple of years Netflix’ streaming service has become almost completely cloud-based, using Amazon's AWS. This talk will delve into our build and deployment architecture, detailing the evolution of our continuous integration systems which helped prepare us for the cloud move. \n
  • We work on the Engineering Tools team at Netflix. Both of us came a long way to be here. \n\nOur team is all about creating tools and systems for our engineers to use to build, test and deploy their apps to the cloud. (and DC if they reaaaaally have to :))\n\nI’ll give an overview of our continuous integration system and how Jenkins fits into it, then Brian will talk about how we’ve extended Jenkins and some of the challenges we’ve found running it at such a large scale.\n\n
  • To get to the cloud, we rearchitected the Netflix streaming service into many individual modules implemented as web services, usually web applications or shared libraries (jars).\nOur team was responsible for creating a set of easy to use tools to simplify and automate the build of the applications and shared libraries.\nWe also were responsible for building the base machine image, creating the architecture for automating the assembly (aka baking - nothing to do with Qwikster !) of the individual application images, and building the web-based tool which is used to deploy and manage the application clusters - but we’ll concentrate on our build process for this talk.\nNote that a key aspect of using so many shared services is that each service team has to rebuild often in order to pick up changes from the other services that they depend on. This is the CONTINUOUS part of continuous integration and is where Jenkins comes in.\n
  • Here are a few details on how we build all those cloud services.\n
  • We wrote a Common Build Framework, based on Ant with some custom Groovy scripting, that’s used by all our development teams to build different kinds of libraries and apps. \nFor the continuous integration to run all those builds, we picked Jenkins because it’s very feature rich, easy to extend, and has a very active community. \nWe use Perforce for our version control system as it’s arguably the best centralized VCS available. But we’re making increasing use of Git; for example, our many open sourced projects are all hosted on GitHub, and we use Jenkins to build them. \nWe publish library JARs and application WAR files to the Artifactory binary repository tool. This gives us access to the build metadata and allows us to add Ivy to Ant to abstract the build and runtime jars into a dynamic dependency graph. So each project only has to know about its immediate dependencies.\nUnlike many shops we don’t use Jenkins plugins to do build tasks such as publishing to Artifactory; these are implemented in our common build framework to give us finer-grained control over functionality without having to patch a bunch of plugins.\n\n\n
  • Here is all you need to do in Jenkins to set up a typical project’s build job. You just tell Jenkins where to find the source code and add in the Common Build Framework, then specify what targets to call from your Ant build file.\n
  • And here is most of a typical project’s Ant and Ivy files. You can see the Ant code simply pulls in one of the standard framework entry points like, library, webapplication, etc. \n\nThen the Ivy file specifies what needs to get built and what are the dependencies. We have some extra Groovy code added to our Ant scripts that can drive Ant targets based on the Ivy artifact definitions. This helps make the build definition declarative and yet flexible.\n\nYes, XML makes your eyes bleed, and there is a lot of redundancy here. But at least it’s small and manageable.\n\n
  • Let’s take a closer look at how we use Jenkins as the core of our build infrastructure, plus a few other interesting uses we’ve come up with.\n
  • *** Other 50% of jobs manual or run on a fixed schedule. ***\n
  • Our Jenkins master runs on a physical server in our data center. The master provides the UI for defining build jobs, plus controlling and monitoring their execution. \nSlave servers are used to execute the actual builds. Our standard slaves can each run 4 simultaneous builds. Custom slave groups are set up for requirements such as C/C++ builds or jobs with high CPU or memory needs.\nWe vary the number of slaves from 15 to 30 depending on demand. This is currently a manual operation but we’re working on autoscaling.\nOur cloud slaves are set up in an AWS Virtual Private Cloud (VPC), which provides common network access between our data centre and AWS. Amazon’s us-west-1 region is physically located close to our data centre, so latency is not an issue.\nAd-hoc slaves in our DC or office are used by individual teams if they need an O/S variant other than those on our standard slaves, or a specific tool or licensed app.\n\nWe keep our standard slaves updated by maintaining a common set of tools (JDKs, Ant, Groovy, etc.) on the master and syncing the tools to the slaves when they are restarted. Custom slaves can also use this mechanism if they choose.\n\n\n \n
  • At its heart Jenkins is just a really nice job scheduler, so we’ve found lots of other uses for it. Here are some of the main ones; in the interest of time I’m not going to describe each one in detail, but please hit us up with questions if you’re interested.\n\nHousekeeping jobs usually use system Groovy scripts for access to the Jenkins runtime. Looking at posting some of these to the public scripts repository.\n\nNow I’ll hand it over to Brian who is going to talk about some scalability challenges and how we’re addressing them.\n
  • We’ve run into a number of scaling challenges as we’ve evolved our build pipeline: Thundering herd problems, modifying and managing 1600 jobs, making sure those 1600 jobs work from Jenkins version to Jenkins version, plugin version to plugin version, and so on.\nOur goal, of course, is to have one button build/test/deploy with as little human intervention as possible, and make the developer’s life as pain-free as we can. All of these get in our way.\n
  • We’ve enhanced Jenkins with a few plugins and odd jobs: \n- We’re working on a job DSL that will allow us to create job templates and simplify the process of configuring new jobs \n- We’ve got a number of housekeeping and maintenance jobs running via Jenkins and system Groovy scripts doing things from disabling builds that consistently fail for a long period of time with no intervention (abandoned jobs) to enforcing consistency in job configuration\n\n
  • And we created the DynaSlave plugin, our cloud-based army of build nodes, to directly address one of our scalability problems: executor exhaustion and deep build queues during thundering herds/build storms.\n
  • When we started the project, our build node fleet was a set of virtual machines in our datacenter.\nAs I mentioned, when we change the build framework, everything tries to rebuild (which sounds crazy but is a good thing--*continuous integration*. The sooner we can find a problem, the sooner we can fix it).\nWe could’ve bounded our changes, restricted them to off hours, but at Netflix, there isn’t really such a thing as off hours and you’re bound to get in someone’s way! We could’ve deployed more VMs, but that involves other teams, leaves us with excess capacity and wasted resources during lulls...\n
  • Plus we had this great platform built on top of AWS and EC2. Why not leverage that?\nWe get to take advantage of our tooling, our experience with the service, we can add and remove capacity on demand, and maybe even make Jenkins master of its own domain and let it control the build node population directly.\n\nAt the time we started building this (mid-2011), nothing plugin-wise we found could maintain a small fixed fleet of AWS resources for us. Plugins seemed to take aim at using EC2 for nothing but spikes in demand, whereas we wanted to forklift the whole fleet into the cloud.\n
  • We put together a plugin that accomplished some of those goals. The DynaSlave plugin currently allows an EC2 node to launch and register itself with Jenkins, totally hands-free. The slaves can tell Jenkins details about what it wants to be, what it can build, and so on. We can tailor nodes to specific needs, create custom pools of nodes with different instance sizes. The plugin, today, has no idea these nodes are even in EC2--pool sizing is managed by AWS ASGs and our cloud management tools like ASGARD, our Amazon management console (Soon to be open sourced!)\n
  • We’re not done, though. We have a number of enhancements in the pipeline, but one of the bigger bits is dynamic resource management.\n\nWe’re still doing some things manually, like controlling the pool size. If someone wants to make a change to our framework, they have to remember to scale the pool up, but not too big as that can kill other systems by proxy, and they have to remember to scale down after the event, but that is EXTRA tedious as resizing ASGs will swat nodes away that are still executing jobs.\nJenkins knows what the queue looks like, Jenkins knows how many slaves are doing work, so we want to make the plugin intelligent enough to manage its own pools, and when it scales the pool down, Jenkins can pause nodes that are idle and make sure those are the ones that are pulled by the ASG, as well as bleed off traffic from busy nodes that need to be reaped.\nWe’re planning on giving this back, so keep an eye on our blog at for announcements to that effect.\n
  • Here are some places to look for more info.\n\nAdrian’s presentations on Slideshare are a great resource if you want to know more about our cloud architecture in general.\n\nWe’re hiring !\n
  • \n
  • Building Cloud Tools for Netflix with Jenkins

    1. 1. Building and Deploying Netflix in the Cloud @bmoyles @garethbowles #netflixcloud
    2. 2. Who Are These Guys?Brian Moyles Gareth Bowles
    3. 3. What We BuildLarge number of loosely-coupled Java Web ServicesCommon code in libraries that can be shared acrossappsEach service is “baked” - installed onto a base AmazonMachine Image and then created as a new AMI ...... and then deployed into a Service Cluster (a set ofAuto Scaling Groups running a particular service)
    4. 4. Getting Built
    5. 5. Build Pipeline Artifactory yum libraries Jenkins CBF steps resolve compile publish report sync check build test sourcePerforce GitHub
    6. 6. build.xml<project name="helloworld"> <import file="../../../Tools/build/webapplication.xml"/></project>ivy.xml<info organisation="netflix" module="helloworld"> <publications> <artifact name="helloworld" type="package" e:classifier="package" ext="tgz"/> <artifact name="helloworld" type="javadoc" e:classifier="javadoc" ext="jar"/> </publications> <dependencies> <dependency org="netflix" name="resourceregistry" rev="latest.${input.status}" conf="compile"/> <dependency org="netflix" name="platform" rev="latest.${input.status}" conf="compile" /> ...
    7. 7. Jenkins at Netflix
    8. 8. Jenkins Statistics1600 job definitions, 50% SCM triggered2000 builds per dayCommon Build Framework updates trigger 800 rebuilds;by scaling up to 20 cloud slaves we can complete theflood of new builds in 30 minutes2TB of build data
    9. 9. Jenkins Architecturex86_64 slave 11 x86_64 slave 1 x86_64 slave buildnode01 1 x86_64 slave Standard buildnode01 custom slaves buildnode01 buildnode01 custom slaves custom slaves slave group misc. architecture custom slaves misc. architecture misc. architecture custom slaves Amazon Linux Single Master misc. architecture m1.xlarge misc. architecture Ad-hoc slaves Red Hat Linux 2x quad core x86_64 misc. O/S & architectures 26G RAMx86_64 slave 11 x86_64Custom x86_64slave 1 slave buildnode01 ~40 custom slaves buildnode01 slave group buildnode01 maintained by product Amazon Linux teams various us-west-1 VPC Netflix data center Netflix data center and office
    10. 10. Other Uses of JenkinsMonitoring of our test and production Cassandra clustersAutomated integration tests, including bake and deployProduction bake and deploymentHousekeeping of the build / deploy infrastructure: Reap unreferenced artifacts in Artifactory Disable Jenkins jobs with no recent successful builds Mark Jenkins builds as permanent if they are used by an active deployment in prod or test Alert owners when slaves get disconnected
    11. 11. Jenkins Scaling ChallengesFlood of simultaneous builds can quickly exhaust all buildexecutors and clog the pipelineFlood of simultaneous builds can hammer rest of theinfrastructure (especially Artifactory)Making global changes to all jobsSome plugins don’t scale to our number of jobs / buildsHard to test every job before upgrading master or pluginsLarge amount of state encapsulated in build data makesrestoring from backup time consuming
    12. 12. Netflix Extensions to Jenkins Job DSL plugin: allow jobs to be set up with minimal definition, using templates and a Groovy-based DSL. Housekeeping and maintenance processes implemented as Jenkins jobs, system Groovy scripts
    13. 13. TheDynaSlavePluginOur cloud-basedarmy of build nodes
    14. 14. The DynaSlave PluginGenesisOriginal build fleet: 15 VMs on datacenter hardware, 8GRAM, single vCPU, 2 executors per nodeMany jobs build on SCM change. Changes to ourcommon build framework create massive thunderingherd since everything depends on it.Ask for more VMs? Modify CBF less frequently?
    15. 15. The DynaSlave PluginWhat We WantedLeverage our extensive AWS infrastructure, tooling, andexperienceNo manual fiddling with machines once they launchQuick and easy to maintain a fixed pool of slave nodesthat can grow/shrink to meet build demand
    16. 16. The DynaSlave PluginWhat We HaveExposes a new endpoint in Jenkins that EC2 instancesin VPC use for registrationAllows a slave to name itself, label itself, tell Jenkinshow many executors it can supportEC2 == Ephemeral. Disconnected nodes that are gonefor > 30 mins are reapedSizing handled by EC2 ASGs, tweaks passed throughvia user data (labels, names, etc)
    17. 17. The DynaSlave PluginWhat’s NextDynamic resource management: have Jenkins respondto build demand and manage its own slave poolsSlave groups: Allows us to create specialized (andisolated from the genpop) pools of build nodesRefresh mechanism for slave tools (JDKs, Ant versions,etc)Enhanced security/registration of nodesGive it back to the community (!)
    18. 18. Further Reading @netflixoss
    19. 19. Thank you @bmoyles @garethbowles
    20. 20. Thank youQuestions? @bmoyles @garethbowles