The presentation will discuss some architectural patterns in continuous integration, deployment and optimization and I will share some of the lessons learned from Amazon.com. The goal of the presentation is to convince you that if you invest your time where you get the maximum learning from your customers, automate everything else in the cloud (CI + CD + CO), you get fast feedback and will be able to release early, release often and recover quickly from your mistakes. Dynamism of the cloud allows you to increase the speed of your iteration and reduce the cost of mistakes so you can continuously innovate while keeping your cost down.
Jinesh Varia – Penn State Data center, FDIC and 5.5 years in AWS.
customers are the center of all equations and discussions. We are continuously listening to customers for feature requests that will make it more easier. At times, where customers don’t know what they want, we rely on statistical significance and confidence intervals to really know the behavior of what impacting the business Hence learning from customers, IMO should be given importance than anything else so that we can get fast feedback, change the course of action, steer to the right direction and align ourselves that makes sense.
Typical workflow from Idea to Release
Focus on learning and Automation with Cloud helps you learn faster, iterate quickly and
There are 3 main ideas
Poka yoke is a design practice originated at Toyota Japan which essentially discusses mistake proofing. Invest in Design systems such that customers don’t make mistakes when they use it. For example, if you were to design an ATM machine and focusing on design how the ATM card will be read, if you design the system that takes the credit card, there is a probabilty that customer might forget it in the machine or the system and mechanics might fail. Instead poka yoke design will be to swiping. That way the customer will never forget as its always in his/her hand and the design is not that complex Continuous integration gives you a poka-yoke design to your build and test systems. Developers get alerted about a bugs ahead of time much before when they are cheaper to fix them.
Everything thing should be in the version control OS, patch levels, OS configuration, your app stack, its configuration, infrastructure config Infrastructure is code, Build scripts is code, test scripts is code
Automate everything People take more time, make errors, they are not auditable Plus its boring and repetitive. They have better things to do, Expensive
Only one way to deploy so you exactly know who, which version, what configuration, easy rollback, which hosts/machines etc.
Leverage the cloud for distributed builds and reduce your build and test run time so you get fast feedback
Stop when you need. Run your builds three times a day and only pay for storage.
Several tools available – most have support for EC2/cloud
you can configure Bamboo to start and shut down elastic instances automatically , based on build queue demands? Please refer to Configuring Elastic Bamboo for more information.
Lets put everything in a context of a web application
You can put this all into a CloudFromation script. I recommend that you put that in version control and let it manage the resources for you. CF engine will manage the ordering, rollback, update etc.
Architecture for Cloud Continuous integration Developer commits to git/master CI Server pulls from git/master CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts CI Server pushes the project manifest to Image manager Package Builder gets configuration information from VSS PB builds packages (gem, .deb, rpms..), stores them in repo PB installs dependencies on an instance/host and creates a virtual Image or multiple images PB generates a CF template from the Config for each environment and stores them in repo CI Server calls deployment server to deploy the image to a given environment (using CF template) Deployment server attempts to deploy the image to instances using cloudformation scripts and alerts the monitoring service
The key advance was using our continuous build system to build not only the artifact from source code, but the complete software stack, all the way up to a deployable image in the form of an AMI (Amazon Machine Image for AWS EC2).
This is where cloud really shines. Where else can you get 100s and 1000s machines to run them in parllel and shut them down when you are done Same goes with benchmarking. Stress test in the Cloud Create AMIs with libraries and dependencies Add “compute power” when needed and turn it off to reduce costs Load test in the Cloud Generate load from one Availability Zone to test on other. Startup a pre-configured TestBox (EC2 instance) in minutes Performance test in the Cloud Test at Global scale - Latency from different parts of the world Store all instrumentation data on S3, SimpleDB. Web App testing Browser based Ajax/Selenium testing from different availability zones (US and EU) Create different deployment environments using scripts Usability Testing On-demand workforce What does Testing in the Cloud mean: Automated, Virtual Test Labs that are live only when you need them Stress test in the Cloud Find the source of latency, Potential Crashes and/or points of Failure. Get Profile information thru logs and instrumentation and measure Load test in the Cloud Generating load from one Availability zone to other “staging” servers on or off 100 concurrent browsing users that randomly click on links. Then, the load can be increased by 100 users every 10 minutes until the total expected user load of 100,000 users is reached. Performance test in the Cloud How fast the page is loading for a given user in given state Usability Testing
Interesting things can be done now. Restore production enviornments and fix bugs at any time.
Software inventory - put code to prod as fast as you can Code that not in production and in the hands of the customer is lost revenue We really want to iterate quickly, deploy often, test, get instant feedback and rollback if necessary but get it out as quickly as possible to the customer
Apollo Environments and Stages Month of May At Amazon, every 11.6 seconds, someone is kicking off a deployment into production. Amazon is doing continous deployment with deployments happening every 11.6 seconds average weekday For the month of may, we kicked off a total of 1079 deployments in a single hour into prod On an average there are around 10,000 hosts receiving a deployment simultanously across a whole bunch of multiple environments For the month of may, we had a peak of around 30,000 hosts simultanously receiving a deployment
speed of iteration matters. not whether you are making the right move but you making the move really really fast. This is great for people like me who often make wrong moves. I can iterate quickly and find the right solution with trial and error. I make this progress quickly through trial and error. failing fast. I dont have a fear of failure It's really important to iterate going forward because it is through iteration, incremenet improvement, we will know what works next and make best practices.
We have to be wrong a lot in order to right a lot Cloud really helps you to reduce the cost of failure.
And we can do that by breaking your deployments into stream of small deployments
The first step is create bootstrapped instances so that when they come up they automatically install all the stuff that you need. You can two that in two ways: have a preconfigured AMI or install all the cool stuff on boot time.
Simple architecture that does continuous deployment using Cloud-init Developer commits to git/master CI Server pulls from git/master (every checkin) CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts and packages stores them in repository (for faster installs) CI Server pushes the project manifest to Image manager Config Server is either Chef Server or Puppet Master CS Manages the chef nodes (or puppet client) CS generates reports provides UI for chef/puppet- managed instances CI Server sends the new code directly to config server (Puppet Master or Chef Server) Deploy server provisions cloud resources DS attempts to deploy the image to instances using cloudformation scripts or EC2 APIs DS leverages the Base AMI with pre-installed client (chef-client or puppetrun) installed
Cloudinit and CF
Simple architecture that does continuous deployment using Chef and/or Puppet Developer commits to git/master CI Server pulls from git/master (every checkin) CI Server runs all regression tests against git/master CI Server returns report to developer and halts if any fail CI Server builds the artifacts and packages stores them in repository (for faster installs) CI Server pushes the project manifest to Image manager Config Server is either Chef Server or Puppet Master CS Manages the chef nodes (or puppet client) CS generates reports provides UI for chef/puppet- managed instances CI Server sends the new code directly to config server (Puppet Master or Chef Server) Deploy server provisions cloud resources DS attempts to deploy the image to instances using cloudformation scripts or EC2 APIs DS leverages the Base AMI with pre-installed client (chef-client or puppetrun) installed
Automate deployment to multiple Azs and increase AZs
So how can you do deployments and quickly deploy without any downtime
One simple technique is to do blue green deloyments. Green is your current deployment, you de
BG + Autoscaling = Cost savings + Scaling up
BG + Darklaunches in which you deploy both the versions side by side but customers does not know about it. Two separate code paths but only one activated. The other one is activated by a feature flag. This can be your beta test where you explicitly turn
Goal is to get instant feedback from your users and how your application behaves when real load. A/B testing is great. Control and Treatment http://20bits.com/articles/an-introduction-to-ab-testing/ Prove success of interface changes Prove interest in new features
What if you have 1000s of nodes. The blue green deployments works well but you may not want to spin up additional 100% capacity. So then you can gradually spin up in small increments of 10% so you only pay for additional 10% of your nodes. Plus you learn more about your customer behavior too. By routing only a small of users to your newly code deployed code base. Start small with 1% - A/B testing is randomizes the traffic and collect feedback on what’s working and what is not working. With a simple configuration change Autoscaling on the other hand scales. You don’t want to send all your traffic to new nodes just yet. Note: you have tested all your application, brought additional capacity before you route real traffic.
Slowly ramp of traffic by simply changing the configuration file in a/b testing and see autoscaling kicking off additional instances as needed. Note here you are not disrupting user behavior because they see only that version that they hit.
Slowly rampup traffic you are know whether the feature is working for your customer and you are learning more about customer.
Pager beeping. Something went wrong. Time to do rollback. Rollbacks are easy…
Change the config file route all the traffic back to your previous hosts…. Test what went wrong…fix….deploy new instances using the same process
Start ramping up again…..
You are gradually deploying new software, using the dynamism of the cloud with zero downtime. And that’s cool.
Autoscaling handles all the capacity planning for you.
Dial up is based on users –user get consistent users experience The beauty of this way is you can run multiple measured Experiments of different features for your customers. Get instant feedback and deploy only if makes sense Based on Statistical significance and confidence intervals, you determine which feature is really causing improvement and creating an impact on the business.
What about datatabases? Those can be tricky and they are not as simple as cutover from green to blue. For all practical purpose, I would recommend using non-relational database. But you can too do blue green style database deployments with zero downtime in several approaches. #3 is the best one because you are leveraging the decoupling of DB and application and breaking your one job into multiple tasks.
Simple Example to change character to Integer database schema change.
Autoscaling in recurring mode
Rollbacks will be easy because you can deploy application and DB combination that worked in the previous iteration.
Again, not all everything can be done when it comes to RDBMS migration. But you can do DB migration and Application in innovative ways to deploy new version in series of steps with zero downtime and easy rollback. speed of iteration matters. not whether you are making the right move but you making the move really really fast. This is great for people like me who often make wrong moves. I can iterate quickly and find the right solution with trial and error. I make this progress quickly through trial and error. failing fast. I dont have a fear of failure It's really important to iterate going forward because it is through iteration, incremenet improvement, we will know what works next and make best practices.
The goal is to do it in small batches
By doing this…and continuous deploying new versions you are reducing the cost of doing mistakes.
And you just saw how cloud really helps here
You keep optimizing and you further reduce your cost, impress your boss, individual can make a significant improvement in bottomline and savings. DB with small optimization, reduced the database footprint, you saved company a bunch of money right away. This does not happen in non-cloud area where even if you implement an optimization, you have already paid for the infrastructure. Again, Poka yoke…Cloud really helps you to design right so you work towards efficiency
Only happens in the cloud
Build websites that sleep at night. Build machines only live when you need it
Shrink your server fleet from 6 to 2 at night and bring back
Now this is just the beginning. We really havent spent enough time here. But one developer wrote a script that will dump all the free resources to custom metrics to cloud watch and if they were 2 weeks, it automatically alerts the users to save money. Now that is cool!! This intelligence can now be backed into your platform. This can only happen when you have a great API and you are only paying for your infrastructure
http://calculator.s3.amazonaws.com/calc5.html?key=calc-D192C939-B31D-4F29-B84E-F07F8E6190F9 http://calculator.s3.amazonaws.com/calc5.html?key=calc-B4C1AD60-742A-401E-B99E-C21862C376AC http://calculator.s3.amazonaws.com/calc5.html?key=calc-A66897DE-2773-494D-B046-7C7191BC6336
Without elasticty you will not be able to accelerate
This slide applies to Amazon EC2, but just as easily describes Amazon S3’s value proposition.
Lets put everything in a context of a web application
See the animation. DirectConnect