Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

3,431 views

Published on

Minimizing customer impact is a key feature in successfully rolling out frequent code updates. Learn how to leverage the AWS cloud so you can minimize bug impacts, test your services in isolation with canary data, and easily roll back changes. Learn to love deployments, not fear them, with a blue/green architecture model. This talk walks you through the reasons it works for us and how we set up our AWS infrastructure, including package repositories, Elastic Load Balancing load balancers, Auto Scaling groups, internal tools, and more to help orchestrate the process. Learn to view thousands of servers as resources at your command to help improve your engineering environment, take bigger risks, and not spend weekends firefighting bad deployments.

Published in: Technology

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. November 13, 2014 | Las Vegas APP307 -Leveraging the Cloud with a Blue-Green Deployment Architecture Jim Plush, Sr. Director of Engineering, CrowdStrike -@jimplush Sean Berry, Principal Software Engineer, CrowdStrike -@schleprachaun
  2. 2. About us
  3. 3. •Founded in September 2011 •~150 employees •Detection/prevention –Advanced cyber threats –Real-time detection –Real-time analytics Cybersecurity startup
  4. 4. Published experts
  5. 5. Event Stream Processing Sensor Targeted Malicious Malware The “CLOUD” {"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 49, "parentId": 48} {"date":"11/14/2014 08:03", "path": “C:WINDOWSSystem32cmd.exe", "id": 50, "parentId": 49} {"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 51, "parentId": 50} DNS Lookup {"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51} TCP Connect {"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51} FTP Download {"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51} Document Exfiltration {"date":"11/14/2014 08:03", "scp": “C:DocumentsTradeSecrets.doc”, “id": 55, "parentId": 54}
  6. 6. Tactical UI
  7. 7. Data ingestion Service A Service A UI Service A Service A API Sensors Termination server Termination server Termination server Termination server Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Sensors Sensors External service Elastic Load Balancing load balancer Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2
  8. 8. •Fortune 500, Think Tanks, Non-Profits •100K+ events per second –Expected to hit 500K EPS by end of 2015 •Each enterprise customer can generate 2-4 TBs of data per day •Microservice architecture •Polyglot environment High scale, big data
  9. 9. Our tech stack is complicated
  10. 10. …but possible because of AWS
  11. 11. Motivation
  12. 12. Solving for the problems •OMG, all servers need to be patched?? •I’m afraid to restart that service; it’s been running for 2 years •Large rolling restarts •Deployment fear –Friday night deploys •B/G for event processing?
  13. 13. Our primary objectives for deployments •Minimize customer impact –Customers should have no indication that anything has changed •Maximize engineer’s weekends –Avoid burnout •Reduce dependencies of rollouts –Everything goes out together, 50+ services, 1000+ VMS
  14. 14. Leveraging AWS •Programmable data centers •Nodes are ephemeral •It should be easier to re-create an environment than to fix it —Think like the cloud
  15. 15. What is blue-green? Router Web server App server Application v1 Shared database Web server App server Application v2 x x
  16. 16. What is blue-green? •Full cluster BG –Everything goes out together –Indiana Jones: “idol switch” •App-based BG –Each app or team controls their ownblue-green deployments
  17. 17. Data plane The data plane can’t blue-green all the things Blue cluster Green cluster Kafka DynamoDB Redis Amazon RDS pgsql Amazon Redshift Amazon Glacier Amazon S3
  18. 18. When do we deploy? •Teams deploy end of sprint releases together •Hot-fix/Upgrades are performed via rolling restart deployments frequently •Early on deployments took an entire day –Lack of automation •Deploys today generally take 45 minutes –Everyone has run a deployment
  19. 19. Sustaining engineer •Every team member including QA has run deployments •Builds confidence, understanding, and redundancy •Ensures documentation is up to date and all things are automated that can be. Sustaining engineer badge of honor shirt after their tour of duty
  20. 20. Deployment day •Apt repo synchronized and locked down •Data plane migrations applied •“Green” cluster is launched (1000s of machines) •IT tests run •Canary customers •Logging and error checks •Active-active •“Blue” marked as inactive, decommissioned
  21. 21. Keys to success Pro tip: It’s not just flipping load balancers
  22. 22. Keys to success Automate all the things •jr devs should be able to run your deploy system
  23. 23. Keys to success Instrumentation & Metrics https://github.com/codahale/metrics https://github.com/rcrowley/go-metrics
  24. 24. Keys to success Use a provisioning system •Chef •Puppet •Salt •baked AMIs
  25. 25. Keys to success Live integration / regression test suites Test System Send deterministic input values Verify processed state
  26. 26. Keys to success Canary Customers V1 App V2 App
  27. 27. Keys to success Feature Flags
  28. 28. Keys to success Unified app requirements
  29. 29. Keys to success Deployment History
  30. 30. –every team member “Thank God we have blue-green”
  31. 31. Implementation
  32. 32. How we blue-green
  33. 33. Elevator pitch on Kafka •Distributed commit log •Similar to a message queue •Allows for replaying messages from earlier in the stream in case of failure
  34. 34. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Sensors Termination server Termination server Termination server Termination server Content Router Content Router Sensors •Blue is running;normal operation •Content Routers are writing to the “active” topics in Kafka •Blue processors read from the “active” topics Sensors Active topic Active topic External service ELB load balancer It all starts with a running cluster
  35. 35. Main management page for blue-green
  36. 36. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane External service ELB load balanceer Sensors Termination server Termination server Termination server Termination server Termination server Termination server Termination server Termination server Content Router Content Router Sensors Sensors Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active topic Launching new cluster Active topic Active topic Inactive Topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router •Green cluster is launched •Termination servers are kept out of the ELB load balancer by failing health checks •Content Routers write to the “active” topics •Processors in green read from the “inactive” topics
  37. 37. Sizing the new cluster
  38. 38. Getting the size right •Sizing of our autoscale groups is determined programmatically –Admin page allows for setting mix / max –Script determines appropriate desired-capacity based on running cluster •Launching is then as simple as updating the autoscale groups to the new sizes defcurrent_counts(region='us-east-1'): proc = Popen( "as-describe-auto-scaling-groups “ “--region {} “ “--max-records=600".format(region), shell=False, stdout=PIPE, stderr=PIPE) out, err = proc.communicate() iferr: raiseException(err) counts = {} forline inout.splitlines(): if"AUTO-SCALING-GROUP"not inline: continue parts = line.split() group = parts[1] current = parts[-2] counts[group] = int(current) returncounts
  39. 39. Tuning size before we launch
  40. 40. Bootstrapping
  41. 41. User data and Chef get things rolling •Inside out Chef bootstrapping –Didn’t feel comfortable running `wget … | bash` •Custom version of Chef installer –Version of Chef –Where to find the Chef servers –Which role to run –Which environment (dev, integ, blue, green)
  42. 42. Testing the new stuff External service ELB load balancer Sensors Termination server Termination server Termination server Termination server Sensors Active topic Active topic Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Termination server Termination server Termination server Termination server Integration tests Active topic Inactive Topic Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router •Test customer(s) are *canaried •Integration test suite is run by connecting to a termination server directly •Tests pass; then we canary real customers
  43. 43. Canary customers
  44. 44. •Canary information is stored in zookeeper •Fortunately we dogfood our own tech •This affords us the ability to use ourselves as canaries for new code •The inactive processing cluster is set to read from the .inactivetopics •The standard Kafka topics with .inactiveappended •The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics •Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive •When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues Canary customers
  45. 45. Canary customers Sensors External service ELB load balancer Event ingestor Kafka Green Processors Inactive Topic Regular Traffic Active topic Blue Processors Active topic Inactive Topic Canary Traffic Customer 123 Customer 456
  46. 46. Let’s canary some customers
  47. 47. That was easy
  48. 48. Testing
  49. 49. IT tests run •Integration tests are run –~3000 tests in total –Test customer must be “canaried” •If any tests fail, we triage and determine if it is still possible to move forward •Testing is only done when we are passing 100%—no exceptions!
  50. 50. Sean is mad -we have work to do
  51. 51. Sean is happy -so we all arehappy
  52. 52. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Trust, but verify! Sensors Termination server Termination server Termination server Termination server Sensors Active Topic Active Topic Inactive Topic Sensors External service ELB load balancer Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router Inactive Topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 •Monitor green services •Verify health of the cluster by inspecting graphicaldata and log outputs •Rerun tests with load
  53. 53. Monitoring
  54. 54. Logging and errorchecking •Every server forwards its relevant logs to Splunk •Several dashboards have been set up with common things to watch for •Raw logs are streamed in near real-time and we watch specifically for log-level ERROR •This is one of our most important steps, as it gives us the most insight into the health of the system as a whole
  55. 55. Logging / ErrorChecking
  56. 56. Moving customers over Termination server Termination server Termination server Termination server Termination server Termination server Termination server Termination server Sensors Sensors Sensors External service ELB load blaancer Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Active topic Active topic Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router Active topic Active topic •Flip all customers back away from canary •Activate green cluster •Event processors and consuming services in blue and green now write to and consume the “active” topics •We are in a state of active-activefor a few minutes
  57. 57. Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics) Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active Topic Kafka Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active -active Inactive Topic Ingestion
  58. 58. Inactive Topic Active topic When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work. Kafka Green, switch to active! Active Topic This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously Active -active Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Ingestion
  59. 59. However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication. Active Topic Kafka Active Topic Active -active Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Ingestion
  60. 60. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Flipping the switch Termination server Termination server Termination server Termination server Content Router Content Router Sensors Sensors Sensors External service ELB load balancer Termination server Termination server Termination server Termination server Content Router Content Router Active topic Active topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Inactive topic Active topic •We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect •Blue processors switch to read from the “inactive” topic •Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned
  61. 61. Out with the old… Termination server Termination server Termination server Termination server Content Router Content Router Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Active topic Active topic Sensors Sensors Sensors External service ELB load balancer Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 •Green is now the active cluster •If we need to roll back code, we have a snapshot of the repository in Amazon S3 •We haven’t had to roll back code… yet
  62. 62. Easing the pain
  63. 63. Bootstapping faster
  64. 64. Half-baked AMIs We use a process to create “half-baked” AMIs, which speed up deployments •JVM (for our Scala code base) •Common tools and configurations •Latest updates to make sure patches are up to date •Build plan is run twice daily Green Server Green Server Green Server Green Server Green Server Green server Green Server Green Server Green Server Green Server Green Server Blue server Half-baked-AMI Auto Scaling group 1 AMI Auto Scale Group Amazon S3
  65. 65. Getting code ready
  66. 66. How code graduates -Development Commit on main Development apt repo Auto deploy changed roles Development cluster
  67. 67. How code graduates -Production Create release-X.X.X or hotfix-X.X.X branches Integration apt repo Production apt repo Same exact Binary Integration cluster Integration apt repo Sync specified Packages for integ New production cluster
  68. 68. Choosing what goes out
  69. 69. Viewing debian details
  70. 70. Integration is synced
  71. 71. Integration is synced
  72. 72. Production is synced from Integ
  73. 73. Updating the data plane
  74. 74. Data plane migrations •Migrations applied to the database are forward only •We have past experiences with two way migrations, but the cost outweigh the benefits. •Code must be forward compatible in case rollbacks are necessary •Database schemas are only modified via migrations even in development and integration environments •We use an in-house migration service(based on flyway) to parallelize the process
  75. 75. Final Thoughts •blue-green deployments can be done in many ways •Our requirement of never losing customer data made this the best solution for us •The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!) •But it is completely worth it, knowing we have a very reliable, fault-tolerant system
  76. 76. Thankyou
  77. 77. http://bit.ly/awsevals Jim:@jimplush Sean:@schleprachaun

×