Building cloud-tools-for-netflix-code mash2012

  • 2,652 views
Uploaded on

Be sure to follow along with the notes below for each slide. …

Be sure to follow along with the notes below for each slide.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,652
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
51
Comments
1
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Abstract: Netflix has moved its production services from being completely datacenter-based, to being primarily cloud-based in a little over a year. This talk will briefly explain why we did it and why we chose Amazon's AWS. The talk will then delve into the evolution of our build and deployment architecture, starting with how our original monolithic DC-based SQL-tied webapp was built and deployed, and then detailing the evolution of our continuous integration systems which helped prepare us for the cloud move. And finally, it will cover our current full build pipeline and deployment system which keeps us agile in the cloud now by providing quick turnaround, and rapid roll-out and recovery.\n\n
  • I manage the Engineering Tools team at Netflix. We are all about creating tools and systems for our software engineers to use to build, test and deploy their apps to the cloud. (and DC too :)\n\n
  • Q: Why did we choose to move to the cloud?\nA: We found that as our rate of growth was growing, we were unable to predict how fast we would need new capacity, and our lead times were too long to keep up if things really took off. We also liked the idea of providing a cloud-like infrastructure to enable our individual engineering teams to be as agile and productive as possible. Since we also knew that we didn’t want to become a cloud provider, we decided to leverage a 3rd party cloud solution.\n\nQ: Why did we choose Amazon AWS?\nA: We didn’t want to be the big fish in a little pond—we wanted to be a medium fish in a big pond. AWS is the largest public cloud around and is the only place we could be smaller than a big fish. Also, we didn’t want to re-invent a lot of infrastructure, so we are making use of a number of AWS services like auto-scaling, simpleDB, elastic load balancing, etc. AWS is at least two years ahead of the closest competition in this area.\n\n
  • A little background about where we were coming from. We were originally one big monolithic Java webapp on a couple hundred machines in a local DC with a single big Oracle database behind it. All these machines were individual dedicated hardware boxes. (IBM Power6 with RedHat + JDK 5) Our IT department managed the DC, hardware, networking, security, machine setup and provisioning.\n\nIn the last year before the cloud move, we began refactoring the big webapp into middle-tier services, and IT introduced VMWare to manage new machine, but machine configuration was still very manual.\n\n
  • A rough picture of our old world:\n\nWe had a bi-weekly push cycle, with a fixed calendar for the year. Alternating weeks were database pushes, then code pushes.\n\nOur “WebCM” tool was a big mess of Perl that did everything for everyone. Builds, database pushes, netscaler control, code pushes, etc. Very brittle.\n\nNew app additions required code change to this tool, and often ran into the brittle areas.\n\n
  • What did we need to do to get to the cloud?\n\nEach individual product team was responsible for refactoring their functionality into a webservice, often times including shared libraries (jars).\n\nMy team was responsible for creating a set of easy to use tools to simplify and automate the build of the applications and shared libraries.\n\nWe also were responsible for building the base machine image, and creating the architecture for automating the assembly (aka baking) of the individual application images.\n\nAnd, we also were responsible for building the web-based tool which is used to deploy and manage the application clusters.\n
  • Let’s step back and take a look at the evolution of our build system, from the old ad-hoc Ant scripts to where we are today.\n
  • Step one: Collect underpants. Pretty much.\n\nThe first step that we made, even pre-cloud, was to carve out common code into libraries that could be shared across apps. This enabled the middle-tier teams migrate out of the big webapp, and still share code with them.\n\nAt the same time, we collected all of the common Ant scripts into a single central package that could be used to build different kinds of libraries and apps.\n\nAlas, most of these libraries were built on developer workstations, and our mechanism for library sharing was to check the jars back into SCM (Perforce).\n\n
  • Step two: ?\n\nThe next step was to get this all on a continuous integration server. We picked Hudson because it was very feature rich, extensible, and had an active community. Now we at least had visibility into what was going on, and less worry about dirty builds published from developer machines.\n\nWe also began to add a lot of build metadata to the manifests so that we could retroactively figure out where things came from even before our whole pipeline had traceability.\n\nAnd, we were also able to publish application artifacts on the Jenkins Job/build pages via archive artifacts. This helped us transition the DC deployment tool to use Jenkins for all of its builds.\n\n
  • Step three: Profit!\n\nThis is where we stopped checking library jars into SCM and started publishing them to Artifactory. This also gave us access to the build metadata and allowed us to add Ivy to Ant to abstract the build and runtime jars into a dynamic dependency graph. So each project only had to know about its immediate dependencies.\n\nWe were then also able to start treating application artifacts like library artifacts and publish them to Artifactory as well.\n\n
  • Step three: Profit!\n\nThis is where we stopped checking library jars into SCM and started publishing them to Artifactory. This also gave us access to the build metadata and allowed us to add Ivy to Ant to abstract the build and runtime jars into a dynamic dependency graph. So each project only had to know about its immediate dependencies.\n\nWe were then also able to start treating application artifacts like library artifacts and publish them to Artifactory as well.\n\n
  • Step three: Profit!\n\nThis is where we stopped checking library jars into SCM and started publishing them to Artifactory. This also gave us access to the build metadata and allowed us to add Ivy to Ant to abstract the build and runtime jars into a dynamic dependency graph. So each project only had to know about its immediate dependencies.\n\nWe were then also able to start treating application artifacts like library artifacts and publish them to Artifactory as well.\n\n
  • Here is most of a typical project’s Ant and Ivy files. You can see the Ant code simply pulls in one of the standard framework entry points like, library, webapplication, etc. \n\nThen the Ivy file specifies what needs to get built and what are the dependencies. We have some extra Groovy code added to our Ant scripts that can drive Ant targets based on the Ivy artifact definitions. This helps make the build definition declarative and yet flexible.\n\nYes, XML makes your eyes bleed, and there is a lot of redundancy here. But at least it’s small and manageable.\n\n
  • Since we were already using Ant and Ivy, and had already incorporated groovy, it is/was a fairly small step to migrate our common build framework to Gradle.\n\nNow we have something lean and mean and clean. Nebula is our “Netflix Build Language” and is really just a Gradle plugin that provides a bunch of default configuration, customizes standard Gradle Java and other tasks, and defines some additional tasks.\n\n
  • OK, back to the build pipeline again. We have a vague “app bundles” output in this diagram.\nLet’s delve into how we manage the application bundle artifacts in more detail.\n\n\n
  • Here is a simplified picture of what the flow looks like at the tail end of the build for getting the application artifacts ready for baking.\n\nThe first generation approach is to treat them like system packages and just package them up into RPMs, and post those to our Yum repo. (We actually scp them into an incoming dir which is then scanned by a daemon which adds the new RPMs to the repo and builds its index). This is problematic due to rpmbuild being RedHat/CentOS/Fedora specific and very brittle even across those distros and versions. It also means that we have to use build slaves with version matching the final production OS version, or we have to be really careful writing spec files.\n\nOur newer approach is to treat app bundles like the other build artifacts. Wrap all the the app files up into a simple archive, like .zip or .tgz, include a BOM file and just publish it to an app repo in Artifactory. We’ll then let the bakery unpack the bundles when it assembles the image.\n\n
  • Here is a simplified picture of what the flow looks like at the tail end of the build for getting the application artifacts ready for baking.\n\nThe first generation approach is to treat them like system packages and just package them up into RPMs, and post those to our Yum repo. (We actually scp them into an incoming dir which is then scanned by a daemon which adds the new RPMs to the repo and builds its index). This is problematic due to rpmbuild being RedHat/CentOS/Fedora specific and very brittle even across those distros and versions. It also means that we have to use build slaves with version matching the final production OS version, or we have to be really careful writing spec files.\n\nOur newer approach is to treat app bundles like the other build artifacts. Wrap all the the app files up into a simple archive, like .zip or .tgz, include a BOM file and just publish it to an app repo in Artifactory. We’ll then let the bakery unpack the bundles when it assembles the image.\n\n
  • Here is a simplified picture of what the flow looks like at the tail end of the build for getting the application artifacts ready for baking.\n\nThe first generation approach is to treat them like system packages and just package them up into RPMs, and post those to our Yum repo. (We actually scp them into an incoming dir which is then scanned by a daemon which adds the new RPMs to the repo and builds its index). This is problematic due to rpmbuild being RedHat/CentOS/Fedora specific and very brittle even across those distros and versions. It also means that we have to use build slaves with version matching the final production OS version, or we have to be really careful writing spec files.\n\nOur newer approach is to treat app bundles like the other build artifacts. Wrap all the the app files up into a simple archive, like .zip or .tgz, include a BOM file and just publish it to an app repo in Artifactory. We’ll then let the bakery unpack the bundles when it assembles the image.\n\n
  • No, not the Qwikster kind of baked.\n\nBaking is where we pre-assemble complete and ready-to-launch machine images for each app.\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines?\n\nWe like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because:\n• More Reliable: less systems that can fail at deploy time right when we need them most.\n• Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise.\n• Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster\n\n
  • The first step of the baking process is to create the “base” image that we will use for baking all app images. This is done once every week or two.\n\nWe start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages:\n• Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc.\n\nThe end result is a beefed-up OS image that is ready to go, and just needs an app added.\n\n
  • The first step of the baking process is to create the “base” image that we will use for baking all app images. This is done once every week or two.\n\nWe start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages:\n• Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc.\n\nThe end result is a beefed-up OS image that is ready to go, and just needs an app added.\n\n
  • The first step of the baking process is to create the “base” image that we will use for baking all app images. This is done once every week or two.\n\nWe start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages:\n• Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc.\n\nThe end result is a beefed-up OS image that is ready to go, and just needs an app added.\n\n
  • App images are baked in a high-speed bakery cluster, and each AWS region (Virginia, California, Oregon, Ireland) has a cluster with a handful of bakery instances.\n\nBase images are pre-mounted as EBS volumes, with a pool of multiple mounts per bakery instance.\nFor each app build the bakery does these things:\n• One base volume is grabbed from the pool\n• An app bundle is pushed or pulled to the bakery and then provisioned onto the image volume.\n• Currently pushing RPMs to YUM repo & pulling them from bakery with yum install.\n• The new way is to publish tarballs to Artifactory and pull them from bakery with wget and \n• The EBS volume is then snapshotted, and the EBS snapshot is registered as an AMI.\n• The volume is detached and a new clean base reattached in its place\n\nLater, unused and obsolete AMIs get cleaned up by the Janitor monkey \n\n
  • App images are baked in a high-speed bakery cluster, and each AWS region (Virginia, California, Oregon, Ireland) has a cluster with a handful of bakery instances.\n\nBase images are pre-mounted as EBS volumes, with a pool of multiple mounts per bakery instance.\nFor each app build the bakery does these things:\n• One base volume is grabbed from the pool\n• An app bundle is pushed or pulled to the bakery and then provisioned onto the image volume.\n• Currently pushing RPMs to YUM repo & pulling them from bakery with yum install.\n• The new way is to publish tarballs to Artifactory and pull them from bakery with wget and \n• The EBS volume is then snapshotted, and the EBS snapshot is registered as an AMI.\n• The volume is detached and a new clean base reattached in its place\n\nLater, unused and obsolete AMIs get cleaned up by the Janitor monkey \n\n
  • App images are baked in a high-speed bakery cluster, and each AWS region (Virginia, California, Oregon, Ireland) has a cluster with a handful of bakery instances.\n\nBase images are pre-mounted as EBS volumes, with a pool of multiple mounts per bakery instance.\nFor each app build the bakery does these things:\n• One base volume is grabbed from the pool\n• An app bundle is pushed or pulled to the bakery and then provisioned onto the image volume.\n• Currently pushing RPMs to YUM repo & pulling them from bakery with yum install.\n• The new way is to publish tarballs to Artifactory and pull them from bakery with wget and \n• The EBS volume is then snapshotted, and the EBS snapshot is registered as an AMI.\n• The volume is detached and a new clean base reattached in its place\n\nLater, unused and obsolete AMIs get cleaned up by the Janitor monkey \n\n
  • \n
  • \n
  • Now that we have all these pre-baked application machine images, what do we do with them?\n\nWe use the standard AWS architecture, plus add just a couple of light abstractions on top to help us manage things.\n\n
  • Now that we have all these pre-baked application machine images, what do we do with them?\n\nWe use the standard AWS architecture, plus add just a couple of light abstractions on top to help us manage things.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Here is an overview of the AWS architectural model, and the way that we use it to create applications and clusters. Each of the items here exists as an entity in the AWS model and can be created, named, listed and modified using the Amazon AWS APIs.\n\nThe heart of our cluster is the Auto Scaling Group (ASG), which is where we define the image to launch, how many instances we want, and what security and network routing they should have. The Elastic Load Balancer (ELB) is used for front-end traffic only, and thus not needed for middle-tier services.\n\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • Every application that we deploy always has this same model, so we maintain an many-to-one association between these entities and each application.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But there is no built-in support for this aggregating entity in AWS, so we just define our own domain in SimpleDB to keep track of them.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • But we need more than one ASG for each app, and this is why.\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • First scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, everything looks good so we switch all traffic over to it. After letting it run through one peak load (overnight usually) we know that it is good and we can terminate the version 007 ASG.\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • Different scenario: Version 007 is running smoothly in production. We push version 008 out, monitor it, things look fair so we switch all traffic over to it. Things start to unravel, and now we can see that 008 wasn’t so great after all. So we just switch traffic back to 007, and then terminate 008. (or maybe leave some 008 machines around for forensic purposes.)\n\n\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • The cluster is another case where we have an abstract entity that doesn’t exist in AWS. In this case, we just use naming patterns to associate related ASGs into logical clusters.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Building for the Cloud @ Netflix @cquinn #netflixcloud (@joesondow)
  • 2. Who Am I?Manager Engineering Tools team @ NetflixAutomated build tools and systemsBase machine image definition and maintenanceApplication machine image bakingNetflix Application Console: deploy and manage
  • 3. Why Cloud? Why AWS?We can’t build datacenters fast enoughWe want to use Clouds, not build themLeverage AWS Scale “the biggest public cloud”Leverage AWS Features “two years ahead of the rest”
  • 4. Legacy WebsiteMostly one Monolithic Java Web AppSingle Big Oracle DBDedicated Data Center Machines
  • 5. Old School Build & Deploy Datacenter LB Push traffic controlWebCM app bundle: war + config Website Website Website Website Dat Website Website aba se u Website Website pda tes Website Website WebsiteCustom Perl web app: Oraclebuild, deploy & manage Manually configured: Linux, Apache, Tomcat
  • 6. Path to AWSBuilding newly refactored Java Web ServicesBaking (assembling) ec2 Amazon Machine ImagesDeploying and Managing Service Clusters
  • 7. Getting Built
  • 8. Step One: Componentize Ant targets Workstation sync test build release source checkin library jars Perforce jars used in app builds
  • 9. Step Two: Automate Ant targets Hudson sync test build release app bundles source, jars checkin library jars Perforce jars used in app builds
  • 10. Step Three: Full Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / apps Jenkins resolve test publish sync compile build report source some GroovyPerforce Ant targets Gradle coming
  • 11. Step Three: Full Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / apps Jenkins resolve test publish sync compile build report source some GroovyPerforce Ant targets Gradle coming
  • 12. build.xml<project name="helloworld"> <import file="../../../Tools/build/webapplication.xml"/></project>ivy.xml<info organisation="netflix" module="helloworld"> <publications> <artifact name="helloworld" type="package" e:classifier="package" ext="tgz"/> <artifact name="helloworld" type="javadoc" e:classifier="javadoc" ext="jar"/> </publications> <dependencies> <dependency org="netflix" name="resourceregistry" rev="latest.${input.status}" conf="compile"/> <dependency org="netflix" name="platform" rev="latest.${input.status}" conf="compile" /> ...
  • 13. build.gradleapply plugin: nebulagroup = netflixartifacts { archives package archives javadocJar}dependencies { compile( [group: netflix, name: resourceregistry, version: latest.release], [group: netflix, name: platform, version: latest.release], ...
  • 14. Step Three: Full Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / apps Jenkins resolve test publish sync compile build report source some GroovyPerforce Ant targets Gradle coming
  • 15. Step Three: Full Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / apps Jenkins resolve test publish sync compile build report source some GroovyPerforce Ant targets Gradle coming
  • 16. App Artifacts app bundles Jenkins Master Slave Slave Slave Slave Slave Slave Slave AWS us-west-1 VPC
  • 17. App Artifacts app bundles Yum Jenkins Master app bundle RPMs Slave Slave Slave Slave Slave Slave Slave AWS us-west-1 VPC
  • 18. App Artifacts app bundles Artifactory app bundle artifacts Yum Jenkins Master app bundle RPMs Slave Slave Slave Slave Slave Slave Slave AWS us-west-1 VPC
  • 19. App Artifacts app bundles Artifactory app bundle artifacts Yum Jenkins Master app bundle RPMs Slave Slave Slave Slave Slave Slave Slave AWS us-west-1 VPC
  • 20. Getting Baked
  • 21. Why bake?
  • 22. Why bake?Traditional:•launch OS•install packages•install app
  • 23. Why bake?Traditional:•launch OS Generic AMI•install packages Instance•install app
  • 24. Why bake?Traditional:•launch OS Generic AMI•install packages Instance•install app
  • 25. Why bake?Traditional:•launch OS Generic AMI•install packages Instance•install app
  • 26. Why bake?Traditional:•launch OS Generic AMI•install packages Instance•install appNetflix:•launch OS+app
  • 27. Why bake?Traditional:•launch OS Generic AMI•install packages Instance•install appNetflix:•launch OS+app App AMI Instance
  • 28. Base Image Baking S3 / EBS foundation AMILinux: CentOS, Fedora, UbuntuYum / Apt install AWS
  • 29. Base Image Baking S3 / EBS foundation AMILinux: CentOS, Fedora, Ubuntu mountYum / Apt RPMs: Apache, Java... Bakery install AWS ec2 slave instances
  • 30. Base Image Baking S3 / EBS foundation AMI baseLinux: CentOS, Fedora, Ubuntu AMI snapshot mountYum / Apt RPMs: Apache, Java... Bakery install AWS ec2 slave instances
  • 31. Base Image Baking S3 / EBS foundation AMI baseLinux: CentOS, Fedora, Ubuntu AMI snapshot mount ReadyYum / for Apt RPMs: Apache, Java... Bakery app install bake AWS ec2 slave instances
  • 32. App Image Baking S3 / EBS base AMI Linux, Apache, Java, TomcatJenkins / Yum / installArtifactory AWS
  • 33. App Image Baking S3 / EBS base AMI Linux, Apache, Java, Tomcat mountJenkins / Yum / app bundle Bakery installArtifactory AWS ec2 slave instances
  • 34. App Image Baking S3 / EBS base AMI app Linux, Apache, Java, Tomcat AMI snapshot mountJenkins / Yum / app bundle Bakery installArtifactory AWS ec2 slave instances
  • 35. App Image Baking S3 / EBS base AMI app Linux, Apache, Java, Tomcat AMI snapshot mountJenkins / Ready to launch Yum / app bundle Bakery installArtifactory AWS ec2 slave instances
  • 36. Getting Deployed
  • 37. Getting DeployedApplicationsClusters
  • 38. Cloud deployment model
  • 39. Cloud deployment model Auto Scaling Group
  • 40. Cloud deployment model Auto Scaling Group Launch Configuration
  • 41. Cloud deployment model Elastic Load Balancer Auto Scaling Group Launch Configuration
  • 42. Cloud deployment model Elastic Load Balancer Auto Scaling Group Launch Configuration Amazon Machine Image
  • 43. Cloud deployment model Elastic Load Balancer Auto Scaling Group Security Group Launch Configuration Amazon Machine Image
  • 44. Cloud deployment model Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Image
  • 45. Cloud deployment model Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Image
  • 46. Cloud deployment model Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Image
  • 47. Cloud deployment model Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon Machine Image
  • 48. Cloud deployment model
  • 49. Cloud deployment model Search
  • 50. Cloud deployment model API Search
  • 51. Cloud deployment model Ratings API Search
  • 52. Cloud deployment modelStreaming Starts Ratings API Search
  • 53. Cloud deployment modelStreaming Starts Ratings API SearchAutocomplete
  • 54. Cloud deployment model Sign UpStreaming Starts Ratings API SearchAutocomplete
  • 55. Cloud deployment model Sign UpStreaming Starts Ratings Application Application Application API SearchAutocomplete Application Application Application
  • 56. Inventing the Application
  • 57. Inventing the Application Problem: Application is not an Amazon concept
  • 58. Inventing the Application Problem: Application is not an Amazon concept Solution: Create an Application domain in SimpleDB Enforce naming conventions on Amazon objects
  • 59. Fast Rollback
  • 60. Fast RollbackOptimism causes outages
  • 61. Fast RollbackOptimism causes outagesProduction traffic is unique
  • 62. Fast RollbackOptimism causes outagesProduction traffic is uniqueKeep old version running
  • 63. Fast RollbackOptimism causes outagesProduction traffic is uniqueKeep old version runningSwitch traffic to new version
  • 64. Fast RollbackOptimism causes outagesProduction traffic is uniqueKeep old version runningSwitch traffic to new versionMonitor results
  • 65. Fast RollbackOptimism causes outagesProduction traffic is uniqueKeep old version runningSwitch traffic to new versionMonitor resultsRevert traffic quickly
  • 66. Fast Rollback
  • 67. Fast Rollback api-frontend api-usprod-v007
  • 68. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 69. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 70. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 71. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 72. Fast Rollback api-frontend api-usprod-v008
  • 73. Fast Rollback
  • 74. Fast Rollback api-frontend api-usprod-v007
  • 75. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 76. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 77. Fast Rollback api-frontend api-usprod-v007 api-usprod-v008
  • 78. Fast Rollback api-frontend api-usprod-v007
  • 79. Inventing the Cluster
  • 80. Inventing the Cluster Problem: Two ASGs with one function but different names
  • 81. Inventing the Cluster Problem: Two ASGs with one function but different names Solution: Append version number in reserved format Parse ASG name to determine long-term “cluster”
  • 82. Thank you @cquinn (@joesondow)http://www.slideshare.net/carleqhttp://techblog.netflix.com
  • 83. Thank youQuestions? @cquinn (@joesondow)http://www.slideshare.net/carleqhttp://techblog.netflix.com