Brian Moyles and Gareth Bowles from Netflix describe the continuous integration system that lets us build and deploy the Netflix streaming service fast and at scale.
3. What We Build
Large number of loosely-coupled Java Web Services
Common code in libraries that can be shared across
apps
Each service is “baked” - installed onto a base Amazon
Machine Image and then created as a new AMI ...
... and then deployed into a Service Cluster (a set of
Auto Scaling Groups running a particular service)
9. Jenkins Statistics
1600 job definitions, 50% SCM triggered
2000 builds per day
Common Build Framework updates trigger 800 rebuilds;
by scaling up to 20 cloud slaves we can complete the
flood of new builds in 30 minutes
2TB of build data
10. Jenkins Architecture
x86_64 slave 11
x86_64 slave 1
x86_64 slave
buildnode01 1
x86_64 slave
Standard
buildnode01 custom slaves
buildnode01
buildnode01 custom slaves
custom slaves
slave group misc. architecture
custom slaves
misc. architecture
misc. architecture
custom slaves
Amazon Linux Single Master misc. architecture
m1.xlarge misc. architecture
Ad-hoc slaves
Red Hat Linux
2x quad core x86_64 misc. O/S & architectures
26G RAM
x86_64 slave 11
x86_64Custom
x86_64slave 1
slave
buildnode01
~40 custom slaves
buildnode01
slave group
buildnode01 maintained by product
Amazon Linux teams
various
us-west-1 VPC Netflix data center Netflix data center and
office
11. Other Uses of Jenkins
Monitoring of our test and production Cassandra clusters
Automated integration tests, including bake and deploy
Production bake and deployment
Housekeeping of the build / deploy infrastructure:
Reap unreferenced artifacts in Artifactory
Disable Jenkins jobs with no recent successful builds
Mark Jenkins builds as permanent if they are used by
an active deployment in prod or test
Alert owners when slaves get disconnected
12. Jenkins Scaling Challenges
Flood of simultaneous builds can quickly exhaust all build
executors and clog the pipeline
Flood of simultaneous builds can hammer rest of the
infrastructure (especially Artifactory)
Making global changes to all jobs
Some plugins don’t scale to our number of jobs / builds
Hard to test every job before upgrading master or plugins
Large amount of state encapsulated in build data makes
restoring from backup time consuming
13. Netflix Extensions to Jenkins
Job DSL plugin: allow jobs to be set up with minimal
definition, using templates and a Groovy-based DSL.
Housekeeping and maintenance processes
implemented as Jenkins jobs, system Groovy scripts
15. The DynaSlave Plugin
Genesis
Original build fleet: 15 VMs on datacenter hardware, 8G
RAM, single vCPU, 2 executors per node
Many jobs build on SCM change. Changes to our
common build framework create massive thundering
herd since everything depends on it.
Ask for more VMs? Modify CBF less frequently?
16. The DynaSlave Plugin
What We Wanted
Leverage our extensive AWS infrastructure, tooling, and
experience
No manual fiddling with machines once they launch
Quick and easy to maintain a fixed pool of slave nodes
that can grow/shrink to meet build demand
17. The DynaSlave Plugin
What We Have
Exposes a new endpoint in Jenkins that EC2 instances
in VPC use for registration
Allows a slave to name itself, label itself, tell Jenkins
how many executors it can support
EC2 == Ephemeral. Disconnected nodes that are gone
for > 30 mins are reaped
Sizing handled by EC2 ASGs, tweaks passed through
via user data (labels, names, etc)
18. The DynaSlave Plugin
What’s Next
Dynamic resource management: have Jenkins respond
to build demand and manage its own slave pools
Slave groups: Allows us to create specialized (and
isolated from the genpop) pools of build nodes
Refresh mechanism for slave tools (JDKs, Ant versions,
etc)
Enhanced security/registration of nodes
Give it back to the community (watch
techblog.netflix.com!)
Abstract: Over the last couple of years Netflix’ streaming service has become almost completely cloud-based, using Amazon's AWS. This talk will delve into our build and deployment architecture, detailing the evolution of our continuous integration systems which helped prepare us for the cloud move. \n
We work on the Engineering Tools team at Netflix. Both of us came a long way to be here. \n\nOur team is all about creating tools and systems for our engineers to use to build, test and deploy their apps to the cloud. (and DC if they reaaaaally have to :))\n\nI’ll give an overview of our continuous integration system and how Jenkins fits into it, then Brian will talk about how we’ve extended Jenkins and some of the challenges we’ve found running it at such a large scale.\n\n
To get to the cloud, we rearchitected the Netflix streaming service into many individual modules implemented as web services, usually web applications or shared libraries (jars).\nOur team was responsible for creating a set of easy to use tools to simplify and automate the build of the applications and shared libraries.\nWe also were responsible for building the base machine image, creating the architecture for automating the assembly (aka baking - nothing to do with Qwikster !) of the individual application images, and building the web-based tool which is used to deploy and manage the application clusters - but we’ll concentrate on our build process for this talk.\nNote that a key aspect of using so many shared services is that each service team has to rebuild often in order to pick up changes from the other services that they depend on. This is the CONTINUOUS part of continuous integration and is where Jenkins comes in.\n
Here are a few details on how we build all those cloud services.\n
We wrote a Common Build Framework, based on Ant with some custom Groovy scripting, that’s used by all our development teams to build different kinds of libraries and apps. \nFor the continuous integration to run all those builds, we picked Jenkins because it’s very feature rich, easy to extend, and has a very active community. \nWe use Perforce for our version control system as it’s arguably the best centralized VCS available. But we’re making increasing use of Git; for example, our many open sourced projects are all hosted on GitHub, and we use Jenkins to build them. \nWe publish library JARs and application WAR files to the Artifactory binary repository tool. This gives us access to the build metadata and allows us to add Ivy to Ant to abstract the build and runtime jars into a dynamic dependency graph. So each project only has to know about its immediate dependencies.\nUnlike many shops we don’t use Jenkins plugins to do build tasks such as publishing to Artifactory; these are implemented in our common build framework to give us finer-grained control over functionality without having to patch a bunch of plugins.\n\n\n
Here is all you need to do in Jenkins to set up a typical project’s build job. You just tell Jenkins where to find the source code and add in the Common Build Framework, then specify what targets to call from your Ant build file.\n
And here is most of a typical project’s Ant and Ivy files. You can see the Ant code simply pulls in one of the standard framework entry points like, library, webapplication, etc. \n\nThen the Ivy file specifies what needs to get built and what are the dependencies. We have some extra Groovy code added to our Ant scripts that can drive Ant targets based on the Ivy artifact definitions. This helps make the build definition declarative and yet flexible.\n\nYes, XML makes your eyes bleed, and there is a lot of redundancy here. But at least it’s small and manageable.\n\n
Let’s take a closer look at how we use Jenkins as the core of our build infrastructure, plus a few other interesting uses we’ve come up with.\n
*** Other 50% of jobs manual or run on a fixed schedule. ***\n
Our Jenkins master runs on a physical server in our data center. The master provides the UI for defining build jobs, plus controlling and monitoring their execution. \nSlave servers are used to execute the actual builds. Our standard slaves can each run 4 simultaneous builds. Custom slave groups are set up for requirements such as C/C++ builds or jobs with high CPU or memory needs.\nWe vary the number of slaves from 15 to 30 depending on demand. This is currently a manual operation but we’re working on autoscaling.\nOur cloud slaves are set up in an AWS Virtual Private Cloud (VPC), which provides common network access between our data centre and AWS. Amazon’s us-west-1 region is physically located close to our data centre, so latency is not an issue.\nAd-hoc slaves in our DC or office are used by individual teams if they need an O/S variant other than those on our standard slaves, or a specific tool or licensed app.\n\nWe keep our standard slaves updated by maintaining a common set of tools (JDKs, Ant, Groovy, etc.) on the master and syncing the tools to the slaves when they are restarted. Custom slaves can also use this mechanism if they choose.\n\n\n \n
At its heart Jenkins is just a really nice job scheduler, so we’ve found lots of other uses for it. Here are some of the main ones; in the interest of time I’m not going to describe each one in detail, but please hit us up with questions if you’re interested.\n\nHousekeeping jobs usually use system Groovy scripts for access to the Jenkins runtime. Looking at posting some of these to the public scripts repository.\n\nNow I’ll hand it over to Brian who is going to talk about some scalability challenges and how we’re addressing them.\n
We’ve run into a number of scaling challenges as we’ve evolved our build pipeline: Thundering herd problems, modifying and managing 1600 jobs, making sure those 1600 jobs work from Jenkins version to Jenkins version, plugin version to plugin version, and so on.\nOur goal, of course, is to have one button build/test/deploy with as little human intervention as possible, and make the developer’s life as pain-free as we can. All of these get in our way.\n
We’ve enhanced Jenkins with a few plugins and odd jobs: \n- We’re working on a job DSL that will allow us to create job templates and simplify the process of configuring new jobs \n- We’ve got a number of housekeeping and maintenance jobs running via Jenkins and system Groovy scripts doing things from disabling builds that consistently fail for a long period of time with no intervention (abandoned jobs) to enforcing consistency in job configuration\n\n
And we created the DynaSlave plugin, our cloud-based army of build nodes, to directly address one of our scalability problems: executor exhaustion and deep build queues during thundering herds/build storms.\n
When we started the project, our build node fleet was a set of virtual machines in our datacenter.\nAs I mentioned, when we change the build framework, everything tries to rebuild (which sounds crazy but is a good thing--*continuous integration*. The sooner we can find a problem, the sooner we can fix it).\nWe could’ve bounded our changes, restricted them to off hours, but at Netflix, there isn’t really such a thing as off hours and you’re bound to get in someone’s way! We could’ve deployed more VMs, but that involves other teams, leaves us with excess capacity and wasted resources during lulls...\n
Plus we had this great platform built on top of AWS and EC2. Why not leverage that?\nWe get to take advantage of our tooling, our experience with the service, we can add and remove capacity on demand, and maybe even make Jenkins master of its own domain and let it control the build node population directly.\n\nAt the time we started building this (mid-2011), nothing plugin-wise we found could maintain a small fixed fleet of AWS resources for us. Plugins seemed to take aim at using EC2 for nothing but spikes in demand, whereas we wanted to forklift the whole fleet into the cloud.\n
We put together a plugin that accomplished some of those goals. The DynaSlave plugin currently allows an EC2 node to launch and register itself with Jenkins, totally hands-free. The slaves can tell Jenkins details about what it wants to be, what it can build, and so on. We can tailor nodes to specific needs, create custom pools of nodes with different instance sizes. The plugin, today, has no idea these nodes are even in EC2--pool sizing is managed by AWS ASGs and our cloud management tools like ASGARD, our Amazon management console (Soon to be open sourced!)\n
We’re not done, though. We have a number of enhancements in the pipeline, but one of the bigger bits is dynamic resource management.\n\nWe’re still doing some things manually, like controlling the pool size. If someone wants to make a change to our framework, they have to remember to scale the pool up, but not too big as that can kill other systems by proxy, and they have to remember to scale down after the event, but that is EXTRA tedious as resizing ASGs will swat nodes away that are still executing jobs.\nJenkins knows what the queue looks like, Jenkins knows how many slaves are doing work, so we want to make the plugin intelligent enough to manage its own pools, and when it scales the pool down, Jenkins can pause nodes that are idle and make sure those are the ones that are pulled by the ASG, as well as bleed off traffic from busy nodes that need to be reaped.\nWe’re planning on giving this back, so keep an eye on our blog at techblog.netflix.com for announcements to that effect.\n
Here are some places to look for more info.\n\nAdrian’s presentations on Slideshare are a great resource if you want to know more about our cloud architecture in general.\n\nWe’re hiring !\n