How Jenkins Builds the Netflix Global Streaming Service

•Download as KEY, PDF•

28 likes•9,896 views

Gareth Bowles

Jenkins User Conference SF 2012 presentation "How Jenkins Builds the Netflix Global Streaming Service"

Technology

How Jenkins Builds the Netﬂix Global Streaming Service

@bmoyles @garethbowles #netﬂixcloud

The Engineering Tools Team
Make build, test and deploy simple and pain-free
Various skills - developers, build/release engineers,
devops guys
CI infrastructure
Common Build Framework
Bakery
Asgard

Jenkins History
Jenkins (and H***** before it) in use at Netﬂix since
2009
Rapid acceptance and increase in usage, thanks to:
Jenkins’ ﬂexibility and ease of use
Our Common Build Framework
Quantity, size and variety of jobs means we need
automation and extensions

What We Build
Large number of loosely-coupled Java Web Services
Common code in libraries that can be shared across
apps
Each service is “baked” - installed onto a base Amazon
Machine Image and then created as a new AMI ...
... and then deployed into a Service Cluster (a set of
Auto Scaling Groups running a particular service)
So our unit of deployment is the AMI

Build Pipeline ASGs

Artifactory yum Bakery Asgard

Jenkins Jars/WARs RPMs AMIs

sync resolve compile publish bake deploy
Common Build Framework

source

Perforce
GitHub

Ant Setup for a New Build Job
build.xml
<project name="helloworld">
<import file="../../../Tools/build/webapplication.xml"/>
</project>

ivy.xml
<info organisation="netflix" module="helloworld">
<publications>
<artifact name="helloworld" type="package"
e:classifier="package" ext="tgz"/>
<artifact name="helloworld" type="javadoc"
e:classifier="javadoc" ext="jar"/>
</publications>
<dependencies>
<dependency org="netflix" name="resourceregistry"
rev="latest.${input.status}" conf="compile"/>
<dependency org="netflix" name="platform"
rev="latest.${input.status}" conf="compile" />
...

Not Just a Build Server

Monitoring of our Cassandra clusters
Automated integration tests, including bake
and deploy
Housekeeping and monitoring of
the build / deploy infrastructure
our AWS environment

Jenkins Statistics
1600 job deﬁnitions, 50% SCM triggered
2000 builds per day
Common Build Framework changes trigger 800 builds
2TB of build data
Over 100 plugins

Current Jenkins Architecture
(One Master to Build Them All)
x86_64 slave 11
x86_64 slave 1
x86_64 slave
buildnode01 1
x86_64 slave
Standard
buildnode01 custom slaves
buildnode01
buildnode01 custom slaves
custom slaves
slave group misc. architecture
custom slaves
misc. architecture
misc. architecture
custom slaves
Amazon Linux Single Master misc. architecture
m1.xlarge misc. architecture
Ad-hoc slaves
Red Hat Linux
2x quad core x86_64 misc. O/S & architectures
26G RAM

x86_64 slave 11
x86_64Custom
x86_64slave 1
slave
buildnode01
~40 custom slaves
buildnode01
slave group
buildnode01 maintained by product
Amazon Linux teams
various

AWS us-west-1 VPC Netﬂix data center Netﬂix DC and ofﬁce

Scaling Challenges -
and How We Met Them

Jenkins Scaling Challenges
Thundering herds clogging Jenkins, downstream
systems
Making global changes across jobs
Making sure plugins can handle 1600 jobs
Testing new Jenkins and plugin versions
Backups

New Architecture (Cloudy
with a Chance of Builds)
x86_64
x86_64 slave
slave 1
Build custom
slave1
buildnode01
group Build Master
custom
slaves
Amazon Linux Ad-hoc slaves
slaves
CentOS or Ubuntu misc.
m1.xlarge misc. O/S &
misc.
m2.2xlarge
architectures
(4 core / 32GB)

x86_64
x86_64 slave
Custom
slave 1
1
slave groups custom
buildnode01
Amazon Linux custom
Test Master slaves
various Ad-hoc slaves
slaves
CentOS or Ubuntu misc.
misc. O/S &
misc.
m2.2xlarge
architectures
(4 core / 32GB)
x86_64
x86_64 slave
slave 1
Test slave
1
buildnode01
group
Amazon Linux
m1.xlarge
AWS us-west-1 VPC AWS and Netﬂix ofﬁce

Deploying a New Master
EBS
Live Test
EBS
master master

ENI
Standby 1 builds.netﬂix.com
EBS Standby 2
EBS

Active ASG Test ASG
Jenkins Master cluster

Building a New Master
Jenkins build watches http://mirrors.jenkins-ci.org/war/
latest/jenkins.war and downloads the new Jenkins
WAR ﬁle
Download latest versions of our installed plugins (from a
list stored externally)
Create an RPM containing the Jenkins webapp
Bake an AMI with the RPM installed under Netﬂix
Tomcat

Job Conﬁguration Management
Conﬁguration Slicing plugin
Great for the subset of conﬁg options that it handles
Don’t want to extend it for each new option
Job DSL plugin
Set up new jobs with minimal deﬁnition, using
templates and a Groovy-based DSL
Change the template once to update all the jobs that
use it

Old and busted...

The new
hotness...

etc.

More Maintenance
System Groovy scripts to clean up old job conﬁgs and
associated artifacts:
Disable jobs that haven’t built successfully for 60
days
Set # of builds to keep if unset
Mark builds permanent if installed on active instances
Delete unreferenced JARs from Artifactory

Keeping Slaves Busy

System Groovy scripts to monitor slave status and
report on problems
The Dynaslave plugin: slave self-registration

Further Reading
http://techblog.netﬂix.com
http://www.slideshare.net/netﬂix
https://github.com/netﬂix
http://jobs.netﬂix.com
@netﬂixoss

Thank you
Questions?

@bmoyles @garethbowles

What's hot

Jenkins tutorialMamun Rashid, CCDH

Dev ops using JenkinsSynergetics Learning and Cloud Consulting

Jenkins OverviewAhmed M. Gomaa

Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Edureka!

Jenkins tutorialHarikaReddy115

What is Jenkins | Jenkins Tutorial for Beginners | EdurekaEdureka!

JenkinsRoger Xia

JenkinsMohanRaviRohitth

Jenkins CI presentationJonathan Holloway

Azure DevOpsFelipe Artur Feltes

Introduction to CI/CDHoang Le

Transforming Organizations with CI/CDCprime

Service Mesh - Why? How? What?Orkhan Gasimov

DevOps without DevOps ToolsJagatveer Singh

Jenkins for java worldAshok Kumar

Continuous integrationamscanne

Introduction to Kafka Cruise ControlJiangjie Qin

(Declarative) Jenkins PipelinesSteffen Gebert

Anatomy of a Continuous Integration and Delivery (CICD) PipelineRobert McDermott

DevOps Monitoring and AlertingKhairul Zebua

What's hot (20)

Jenkins tutorial

Dev ops using Jenkins

Jenkins Overview

Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...

Jenkins tutorial

What is Jenkins | Jenkins Tutorial for Beginners | Edureka

Jenkins

Jenkins CI presentation

Azure DevOps

Introduction to CI/CD

Transforming Organizations with CI/CD

Service Mesh - Why? How? What?

DevOps without DevOps Tools

Jenkins for java world

Continuous integration

Introduction to Kafka Cruise Control

(Declarative) Jenkins Pipelines

Anatomy of a Continuous Integration and Delivery (CICD) Pipeline

DevOps Monitoring and Alerting

Viewers also liked

Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax

Continuous Delivery at NetflixRob Spieldenner

Outside The Box With Apache CassnadraEric Evans

Cassandra: Not Just NoSQL, It's MoSQLEric Evans

C* Summit 2013: Cassandra at Instagram by Rick BransonDataStax Academy

Building Cloud Tools for Netflix with JenkinsGareth Bowles

Netflix API - Separation of ConcernsDaniel Jacobson

NetflixOSS meetup lightning talks and roadmapRuslan Meshenberg

Cassandra at Instagram (August 2013)Rick Branson

Continuous delivery appliedMike McGarr

Cassandra Summit 2014: Cassandra @ Netflix: Building a House of Cards on a So...DataStax Academy

Continuous Delivery at Netflix, and beyondMike McGarr

Cassandra Virtual Node talkPatrick McFadin

Devops at Netflix (re:Invent)Jeremy Edberg

The DynaSlave PluginBrian Moyles

Etsy Activity Feeds ArchitectureDan McKinley

CultureReed Hastings

Viewers also liked (17)

Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016

Continuous Delivery at Netflix

Outside The Box With Apache Cassnadra

Cassandra: Not Just NoSQL, It's MoSQL

C* Summit 2013: Cassandra at Instagram by Rick Branson

Building Cloud Tools for Netflix with Jenkins

Netflix API - Separation of Concerns

NetflixOSS meetup lightning talks and roadmap

Cassandra at Instagram (August 2013)

Continuous delivery applied

Cassandra Summit 2014: Cassandra @ Netflix: Building a House of Cards on a So...

Continuous Delivery at Netflix, and beyond

Cassandra Virtual Node talk

Devops at Netflix (re:Invent)

The DynaSlave Plugin

Etsy Activity Feeds Architecture

Culture

Similar to How Jenkins Builds the Netflix Global Streaming Service

Lightweight Virtualization in LinuxSadegh Dorri N.

AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS Riyadh User Group

Linux container & dockerejlp12

Windsor: Domain 0 Disaggregation for XenServer and XCPThe Linux Foundation

LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISORVanika Kapoor

Docker: under the hoodJake Wise

Creating an effective developer experience on KubernetesLenses.io

Docker - Portable Deploymentjavaonfly

KVM and docker LXC Benchmarking with OpenStackBoden Russell

Weave User Group Talk - DockerCon 2017 RecapPatrick Chanezon

CI and CD at Scale: Scaling Jenkins with Docker and Apache MesosCarlos Sanchez

Automating Your CloudStack Cloud with Puppetbuildacloud

Unikernels: the rise of the library hypervisor in MirageOSDocker, Inc.

(ATS3-DEV02) Scripting with .NET Assemblies in Symyx NotebookBIOVIA

Collaborate vdb performanceKyle Hailey

The State of Linux Containersinside-BigData.com

ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...DynamicInfraDays

Dev opsec dockerimage_patch_n_lifecyclemanagement_2019kanedafromparis

Unikernels: Rise of the Library HypervisorAnil Madhavapeddy

Kernel maintainance in Linux distributions: DebianAnne Nicolas

Similar to How Jenkins Builds the Netflix Global Streaming Service (20)

Lightweight Virtualization in Linux

AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox

Linux container & docker

Windsor: Domain 0 Disaggregation for XenServer and XCP

LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR

Docker: under the hood

Creating an effective developer experience on Kubernetes

Docker - Portable Deployment

KVM and docker LXC Benchmarking with OpenStack

Weave User Group Talk - DockerCon 2017 Recap

CI and CD at Scale: Scaling Jenkins with Docker and Apache Mesos

Automating Your CloudStack Cloud with Puppet

Unikernels: the rise of the library hypervisor in MirageOS

(ATS3-DEV02) Scripting with .NET Assemblies in Symyx Notebook

Collaborate vdb performance

The State of Linux Containers

ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...

Dev opsec dockerimage_patch_n_lifecyclemanagement_2019

Unikernels: Rise of the Library Hypervisor

Kernel maintainance in Linux distributions: Debian

Recently uploaded

DBX First Quarter 2024 Investor PresentationDropbox

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Exploring Multimodal Embeddings with MilvusZilliz

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Vector Search -An Introduction in Oracle Database 23ai.pptx

How to Troubleshoot Apps for the Modern Connected Worker

Understanding the FAA Part 107 License ..

Artificial Intelligence Chap.5 : Uncertainty

presentation ICT roal in 21st century education

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Boost Fertility New Invention Ups Success Rates.pdf

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Exploring Multimodal Embeddings with Milvus

AWS Community Day CPH - Three problems of Terraform

Six Myths about Ontologies: The Basics of Formal Ontology

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Strategies for Landing an Oracle DBA Job as a Fresher

How Jenkins Builds the Netflix Global Streaming Service

1. How Jenkins Builds the Netﬂix Global Streaming Service @bmoyles @garethbowles #netﬂixcloud

2. The Engineering Tools Team Make build, test and deploy simple and pain-free Various skills - developers, build/release engineers, devops guys CI infrastructure Common Build Framework Bakery Asgard

3. Jenkins History Jenkins (and H***** before it) in use at Netﬂix since 2009 Rapid acceptance and increase in usage, thanks to: Jenkins’ ﬂexibility and ease of use Our Common Build Framework Quantity, size and variety of jobs means we need automation and extensions

4. What We Build Large number of loosely-coupled Java Web Services Common code in libraries that can be shared across apps Each service is “baked” - installed onto a base Amazon Machine Image and then created as a new AMI ... ... and then deployed into a Service Cluster (a set of Auto Scaling Groups running a particular service) So our unit of deployment is the AMI

5. Getting Built

6. Build Pipeline ASGs Artifactory yum Bakery Asgard Jenkins Jars/WARs RPMs AMIs sync resolve compile publish bake deploy Common Build Framework source Perforce GitHub

7. Jenkins Setup for a New Build Job

8. Ant Setup for a New Build Job build.xml <project name="helloworld"> <import file="../../../Tools/build/webapplication.xml"/> </project> ivy.xml <info organisation="netflix" module="helloworld"> <publications> <artifact name="helloworld" type="package" e:classifier="package" ext="tgz"/> <artifact name="helloworld" type="javadoc" e:classifier="javadoc" ext="jar"/> </publications> <dependencies> <dependency org="netflix" name="resourceregistry" rev="latest.${input.status}" conf="compile"/> <dependency org="netflix" name="platform" rev="latest.${input.status}" conf="compile" /> ...

9. Jenkins at Netﬂix

10. Not Just a Build Server Monitoring of our Cassandra clusters Automated integration tests, including bake and deploy Housekeeping and monitoring of the build / deploy infrastructure our AWS environment

11. Jenkins Statistics 1600 job deﬁnitions, 50% SCM triggered 2000 builds per day Common Build Framework changes trigger 800 builds 2TB of build data Over 100 plugins

12. Current Jenkins Architecture (One Master to Build Them All) x86_64 slave 11 x86_64 slave 1 x86_64 slave buildnode01 1 x86_64 slave Standard buildnode01 custom slaves buildnode01 buildnode01 custom slaves custom slaves slave group misc. architecture custom slaves misc. architecture misc. architecture custom slaves Amazon Linux Single Master misc. architecture m1.xlarge misc. architecture Ad-hoc slaves Red Hat Linux 2x quad core x86_64 misc. O/S & architectures 26G RAM x86_64 slave 11 x86_64Custom x86_64slave 1 slave buildnode01 ~40 custom slaves buildnode01 slave group buildnode01 maintained by product Amazon Linux teams various AWS us-west-1 VPC Netflix data center Netflix DC and office

13. Scaling Challenges - and How We Met Them

14. Jenkins Scaling Challenges Thundering herds clogging Jenkins, downstream systems Making global changes across jobs Making sure plugins can handle 1600 jobs Testing new Jenkins and plugin versions Backups

15. New Architecture (Cloudy with a Chance of Builds) x86_64 x86_64 slave slave 1 Build custom slave1 buildnode01 group Build Master custom slaves Amazon Linux Ad-hoc slaves slaves CentOS or Ubuntu misc. m1.xlarge misc. O/S & misc. m2.2xlarge architectures (4 core / 32GB) x86_64 x86_64 slave Custom slave 1 1 slave groups custom buildnode01 Amazon Linux custom Test Master slaves various Ad-hoc slaves slaves CentOS or Ubuntu misc. misc. O/S & misc. m2.2xlarge architectures (4 core / 32GB) x86_64 x86_64 slave slave 1 Test slave 1 buildnode01 group Amazon Linux m1.xlarge AWS us-west-1 VPC AWS and Netﬂix ofﬁce

16. Deploying a New Master EBS Live Test EBS master master ENI Standby 1 builds.netﬂix.com EBS Standby 2 EBS Active ASG Test ASG Jenkins Master cluster

17. Building a New Master Jenkins build watches http://mirrors.jenkins-ci.org/war/ latest/jenkins.war and downloads the new Jenkins WAR ﬁle Download latest versions of our installed plugins (from a list stored externally) Create an RPM containing the Jenkins webapp Bake an AMI with the RPM installed under Netﬂix Tomcat

18. Job Configuration Management Configuration Slicing plugin Great for the subset of config options that it handles Don’t want to extend it for each new option Job DSL plugin Set up new jobs with minimal definition, using templates and a Groovy-based DSL Change the template once to update all the jobs that use it

19. Old and busted... The new hotness... etc.

20. More Maintenance System Groovy scripts to clean up old job conﬁgs and associated artifacts: Disable jobs that haven’t built successfully for 60 days Set # of builds to keep if unset Mark builds permanent if installed on active instances Delete unreferenced JARs from Artifactory

21. Keeping Slaves Busy System Groovy scripts to monitor slave status and report on problems The Dynaslave plugin: slave self-registration

22. Further Reading http://techblog.netflix.com http://www.slideshare.net/netflix https://github.com/netflix http://jobs.netflix.com @netflixoss

23. Thank You To Our Sponsors

24. Thank you @bmoyles @garethbowles

25. Thank you Questions? @bmoyles @garethbowles

Editor's Notes

Abstract: Hi, I&#x2019;m Gareth and this is Brian, we&#x2019;re from Netflix. Welcome to part 2 of the Netflix invasion of the Jenkins User Conference !\n\nOver the last couple of years, Netflix&#x2019; streaming movie and TV service has moved almost completely to the cloud, using Amazon Web Services. You guys may well have seen some of the great presentations and blog posts by our engineers talking about how our production service works - but what does it take to get that service into production ? This talk will delve into our build and deployment architecture, concentrating on how we use Jenkins as the centrepiece of our production line for the cloud. \n
Brian and I work on the Engineering Tools team at Netflix. There are 6 of us here today (about half our team, including Justin who gave the talk on getting to your 3rd plugin this morning), so find any of us later if you want to know more about Netflix (especially if you&#x2019;d like to work with us, we&#x2019;re hiring !)\n\nOur team is responsible for making standard tools for all our engineers to make building, testing and deploying their products simple and pain-free. \n\nWe&#x2019;re now a fairly big team with a diverse range of skills, everyone can go pretty far up and down the stack (although don&#x2019;t ask me to do any UI work if you don&#x2019;t want to make your eyes bleed !)\n\nOur team runs the Netflix CI infrastructure, including a Common Build Framework that we wrote to make it trivial for developers to create builds for their products.\n\nAmong other tools that we wrote are the Bakery, which creates our custom Amazon Machine Images for deploying to the cloud, and Asgard, the open source console that you can use to manage your AWS deployments. More on these coming up.\n
We&#x2019;ve been using Jenkins (and its predecessor who Shall Not Be Named) for a few years now and it&#x2019;s become a vital part of our CI infrastructure. \n\nUse of Jenkins caught on really fast, partly because it&#x2019;s easy for people who aren&#x2019;t professional build engineers to use it, and also due to our Common Build Framework that makes it a trivial operation to set up a new build job (more on that later). In fact, Netflix&#x2019;s product teams don&#x2019;t have full time release engineers; our Engineering Tools team aims to make the build / test / deploy process trivial enough not to get in the way of product development. Jenkins has been a big help with that. \n\nWe&#x2019;ve become one of the bigger Jenkins users (although not the biggest, for those of you who saw the &#x201C;Planned Parenthood&#x201D; talk by the Intel guys earlier, they are an order of magnitude bigger than us) quite fast, and that&#x2019;s given our team quite a few challenges in meeting the demand. We&#x2019;ve come up with custom plugins, scripts and use patterns for Jenkins that we think are pretty cool and certainly make our lives easier; we&#x2019;ll share many of them in this talk. \n
Let&#x2019;s start with some background on what we build using Jenkins.\n\nTo get to the cloud, we rearchitected the Netflix streaming service from a monolithic Java webapp into many individual modules implemented as loosely-coupled web services, usually web applications or shared libraries (JARs). Most services implement a server and a client library that other services can use to talk to it.\n\nWe decided that our unit of deployment would be the Amazon Machine Image - a complete description of a server including the operating system, not just an install package. This frees us from having to maintain a configuration server such as Puppet or Chef, and makes it very easy to roll back and forward between versions of a service - you just kill instances of a version you don&#x2019;t want and spin up instances with a different version. Netflix is all about continually trying out new features and rolling forward (preferably) or back fast if needed.\n\nWe maintain a base AMI that encapsulates our selected o/s - currently CentOS, soon to be Ubuntu - plus a host of other basic services that provide monitoring, logging, web and application servers (Apache HTTPD and Tomcat) and much more. We have an application called the Bakery (soon to be open sourced) which handles installation of Netflix services onto a base AMI to create a custom AMI for each service.\n\nOnce the AMIs are ready, we deploy them into AWS Auto Scaling Groups using Asgard, our open source management console. Asgard is a fantastic piece of work that goes way beyond what the standard AWS Management Console can do, and I encourage any of you who use AWS to check it out.\n\nNote that a key aspect of using so many shared services is that each service team has to rebuild often in order to pick up changes from the other services that they depend on. This is the CONTINUOUS part of continuous integration and is where Jenkins comes in.\n
Here are a few details on how we build all those cloud services.\n
Here are the tools that form our continuous build and deployment pipeline. Jenkins jobs can provide the entire workflow.\n\nWe wrote a Common Build Framework, based on Ant with some custom Groovy scripting, that&#x2019;s used by all our development teams to build different kinds of libraries and apps. (If you think Ant is a bit long in the tooth, so do we; we&#x2019;re working on a brand new build framework using Gradle, but that will take a while to finish up and get everyone converted over).\n\nFor the continuous integration to run all those builds, we picked Jenkins because it&#x2019;s very feature rich, easy to extend, and has a very active and helpful community (thanks, guys !). \n\nWe use Perforce for our version control system as it&#x2019;s arguably the best centralized VCS available. But we&#x2019;re making increasing use of Git; for example, our many open sourced projects are all hosted on GitHub, and we use Jenkins to build them. \n\nWe publish library JARs and application WAR files to the Artifactory binary repository tool. This lets us access and add to the metadata for each published artifact, and allows us to add Apache Ivy to Ant to handle dependency resolution. So each project only has to know about its immediate dependencies.\n\nJARs and WARs are then bundled into RPM packages for installation.\n\nI mentioned the bake and deploy process for the finished artifacts just now. The Common Build Framework includes Ant tasks that access the Bakery and Asgard APIs to run these operations. Being able to do that workflow from a Jenkins job can save a lot of time; for example, our API team was able to move from one deployment every two weeks to multiple deployments per week.\n\nIt&#x2019;s worth mentioning that the full automated deployment cycle in mainly used in our test environment; deploying to production needs extra checks and balances that tend to be specific to the service being deployed, many of those aren&#x2019;t yet fully automated. \n\n\n
Here is all you need to do in Jenkins to set up a typical project&#x2019;s build job. You just tell Jenkins where to find the source code and add in the Common Build Framework, then specify where your Ant build file is and what targets to call from it.\n\nEven this minimal setup can be tedious when repeated for a large number of jobs, so we&#x2019;re working on a plugin to make it even more trivial; more on that later.\n\n
And here is a simple project&#x2019;s Ant build file and Ivy publish & dependency definition file. \n\nYou can see the Ant code simply pulls in one of the standard framework entry points like, library, webapplication, etc. \n\nThe Ivy publications section just defines what type of artifacts you want to publish; the Common Build Framework has the knowledge of how to create those artifacts.\n\n\n
Let&#x2019;s take a closer look at how we use Jenkins as the core of our build infrastructure, plus a few other interesting uses we&#x2019;ve come up with.\n
At its heart Jenkins is just a really nice job scheduler, so we&#x2019;ve found lots of other uses for it. Here are some of the main ones.\n\nOur Cassandra team has their own Jenkins instance for monitoring the health of our Cassandra clusters and performing automated repairs on them. This one is a great testimonial for Jenkins&#x2019; ease of use; Engineering Tools just gave the Cassandra guys a basic Jenkins master and some slaves, then they set up their own jobs, a custom dashboard, and even turned the blue balls for build success to green ones. \n\nHousekeeping jobs usually use system Groovy scripts for access to the Jenkins runtime. We effectively use Jenkins to manage Jenkins and its associated apps, in a similar way to the Infrastructure jobs on https://ci.jenkins-ci.org.\n\nWe&#x2019;ve posted a few of these scripts to the public Scriptler repository and will be adding more. Please check them out and feel free to make changes.\n\n\n
The CBF build floods run in 2 stages: the first main one is triggered by the SCM change, then downstream test and deploy jobs are triggered by Jenkins. \n\n\n
Our Jenkins master runs on a physical server in our data center. \n\nWe use the master strictly as a build orchestrator and a runner for housekeeping scripts; we don&#x2019;t run regular builds on the master to avoid overloading it.\n\nSlave servers are used to execute the actual builds. Our standard slaves can each run 4 simultaneous builds. Custom slave groups are set up for requirements such as C/C++ builds or jobs with high CPU or memory needs.\n\nWe make extensive use of slave labelling to control the type of slave that each job executes on, and we have housekeeping jobs that make sure that build jobsare using the correct labels. \n\nWe vary the number of standard slaves from 10 to 30 depending on demand. This is currently a manual operation (but all you have to do is run a Jenkins job !); we&#x2019;re working on autoscaling.\n\nOur cloud slaves are set up in an AWS Virtual Private Cloud (VPC), which provides common network access between our data centre and AWS. Amazon&#x2019;s us-west-1 region is physically located close to our data centre, so latency is not an issue.\n\nAd-hoc slaves in our DC or office are used by individual teams if they need an O/S variant other than those on our standard slaves, or a specific tool or licensed app. Typical uses are for our Partner Product Development team, which makes the Netflix clients that run on TVs, DVD players, smartphones, tablets, and soon - I&#x2019;m reliably informed - toasters. \n\nWe keep our standard slaves updated by maintaining a common set of tools (JDKs, Ant, Groovy, etc.) on the master and syncing the tools to the slaves when they are restarted. Custom slaves can also use this mechanism if they choose.\n\nThe single master has definitely reduce maintenance overhead and, thanks to continuing performance improvements from the Jenkins community, it&#x2019;s handled our expanding workload pretty well up to now. Still, it&#x2019;s a single point of failure, and we&#x2019;re starting to run up against a number of scaling issues with this architecture. \n\nThat&#x2019;s the present - now I&#x2019;ll hand it over to Brian to take us into the future.\n\n\n \n
Thanks Gareth. I&#x2019;m going to talk about some of the scaling challenges we&#x2019;ve faced in building and growing our Jenkins infrastructure, and what we&#x2019;ve done to address them.\n
We&#x2019;ve run into a number of scaling challenges as we&#x2019;ve evolved our build pipeline: Thundering herd problems, modifying and managing 1600 jobs, making sure those 1600 jobs work from Jenkins version to Jenkins version, plugin version to plugin version, and so on.\n\nA good example problem: when we update our build framework, this often triggers all SCM triggered jobs that use the CBF. We can scale our slave pool up, but the slaves push to artifactory, resolve deps from artifactory, and so on, so more slaves means more artifactory load. It&#x2019;s a balancing act.\n\nOur goal, of course, is to have one button build/test/deploy with as little human intervention as possible, and make the developer&#x2019;s life as pain-free as we can. All of these scaling issues get in the way of that. Here are some of the ways we&#x2019;re working on eliminating them.\n
To start, we&#x2019;re working on a sharded model for our builds.\nOur current thought is we&#x2019;ll split the current master into multiple AWS instances dedicated to different types of work: standard CI builds, integration tests, non-Java builds for our device clients, etc.\n\nThe set of plugins for each master can be streamlined to fit the type of job. Isolate long-running jobs so we don&#x2019;t bounce Jenkins for maintenance in the middle of someone&#x2019;s 10 hour integration test, schedule regular planned maintenance for the long-running cluster, etc.\n\nWe&#x2019;ll use the same bake & deploy architecture that powers our streaming service to run the Jenkins masters and slaves.\n\n\n\n \n
To deploy the Jenkins master, we&#x2019;ll use the same &#x201C;Red / Black deploy&#x201D; principle that our streaming service uses in production. We launch and test the master in a manner that prevents it from taking on traffic. When we&#x2019;re satisfied with where things are, we flip a switch, and the test becomes the real master.\n\nTo go into more detail:\nThe live ASG will have a master that&#x2019;s running builds, plus one or more standby instances with builds disabled. We store job data on EBS volumes--the Amazon term for persistent storage that doesn&#x2019;t disappear with cloud nodes. The job data is synced regularly from the live master to the standbys.\n\nThe test ASG has a similar job data setup, but uses a representative subset of jobs to give us confidence in the new master and plugin versions.\n\nIf we want to switch to a new master version, or fail over to one of the hot standbys, we bring down the live master, attach the live job data volume to the new master, and finally do a network switch to make the new master live. We&#x2019;re planning to use Elastic Network Interfaces (ENIs) to do the network switchover; these are virtual network interfaces which can be moved from one EC2 instance to another. \n\nWe haven&#x2019;t fully worked out how we will do job data replication and network switchover yet, so we&#x2019;d be interested to hear your experiences if you have a similar setup. CloudBees&#x2019; HA plugin might be an option here; we&#x2019;ll definitely be looking at that.\n\n\n\n\n\n\n
We&#x2019;ve set up a Jenkins build to make us a new master image as soon as a new Jenkins release is available.\nWe could have used the standard Jenkins RPMs, but this way lets us deploy to the same Netflix Tomcat used by production services and get things like monitoring, logging, Apache proxy etc. for free from our base AMI.\n\n\n
How do we change everyone to a new version of a tool, add a new plugin to everyone&#x2019;s job or make other common configuration changes to a large number of jobs ?\n\nThe Configuration Slicing plugin is really nice as long as you want to change one of the subset of options that it handles.\n\nJustin from our team, who presented on &#x201C;Getting to Your Third Plugin&#x201D; earlier, has written a job DSL plugin that will allow us to create job templates and simplify the process of configuring new jobs, using a Groovy DSL to define the configuration. Here&#x2019;s a quick peek at how easy that is:\n\n\n
On the left, a typical job configuration page.\nOn the right, a peek at the groovy DSL you&#x2019;d use to set up a job. This can be made even smaller using templates--pull out the job name and repo location, and you can have a very tiny, simple, reusable representation of a much more complex object, allowing you to do things in a very repeatable, consistent fashion.\n\n
We also employ a number of system groovy scripts for housekeeping both inside Jenkins itself and outside. Jobs will disable abandoned jobs that have been failing for 60 days, ensure everyone&#x2019;s retaining some build history, flagging builds as permanent if they get used in production, and so on...\n\n
To help us keep builds moving through the system, we need to keep our slave fleet in shape.\n\nWe have system groovy scripts keeping watch over the slaves that complain when something comes up that requires attention. For example, If a specialized slave goes offline for an extended period of time, the team that owns the slave will be notified so they can take action.\n\nWe also wrote the dynaslave plugin to help manage slave registration, isolation, and cleanup after disconnect.\nA quick note for those who cannot make it to the lightning talk--the dynaslave plugin, very simply, allows for slave self-registration without having to tell Jenkins anything about their actual infrastructure. We&#x2019;ll be open sourcing this plugin, so check out the lightning talk or look for our slides on the subject after the conference.\n\nWe currently manage some aspects of the dynaslave system using groovy scripts as well. For those that saw our colleague Justin Ryan&#x2019;s presentation earlier, always reach for Groovy first when you feel the temptation to play with plugin internals :)\n
If you found that interesting, you can read more at these links, especially the jobs link :) \nCheck out the Netflix OSS twitter account for announcements around our open source efforts.\n\n
A big thanks to Cloudbees and the sponsors of the conference for allowing us the opportunity to speak to everyone\n
And thanks to you all for listening. Questions?\n

How Jenkins Builds the Netflix Global Streaming Service

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to How Jenkins Builds the Netflix Global Streaming Service

Similar to How Jenkins Builds the Netflix Global Streaming Service (20)

Recently uploaded

Recently uploaded (20)

How Jenkins Builds the Netflix Global Streaming Service

Editor's Notes