Talk at RELENG 2014
Full paper: http://www.nicta.com.au/pub?doc=7925
The continuous delivery trend is dramatically shortening release cycles from months into hours. Applications with high frequency releases often rely heavily on automated deployment tools using cloud infrastructure APIs. We report some results from experiments on reliability issues of cloud infrastructure and trade-offs between using heavily-baked and lightly-baked images. Our experiments were based on Amazon Web Service (AWS) OpsWorks APIs and configuration management tool Chef. As a result of our experiments, we then propose error handling practices that can be included in tailor-made continuous deployment facilities.
More related info at our DevOps book http://www.ssrg.nicta.com.au/projects/devops_book/
Challenges in Practicing High Frequency Releases in Cloud Environments
1. NICTA Copyright 2012 From imagination to impact
Challenges in Practicing
High Frequency Releases in
Cloud Environments
Liming Zhu, Donna Xu, Xiwei Xu, An Binh
Tran, Ingo Weber, Len Bass
NICTA/UNSW
http://slideshare.net/limingzhu
2. NICTA Copyright 2012 From imagination to impact
NICTA (National ICT Australia)
• Australia’s National Centre of Excellence in
Information and Communication Technology
• Five Research Labs:
– ATP: Australian Technology Park, Sydney
– NRL: UNSW, Sydney
– CRL: ANU, Canberra
– VRL: Uni. Melbourne
– QRL: Uni. Queensland and QUT
• 700 staff including 270 PhD students
• Research Groups
– Software Systems Research Group (SSRG)
• ssrg.nicta.com.au
– Machine Learning, Optimisation, Networks,
Computer Vision
3. NICTA Copyright 2012 From imagination to impact
Challenge: High Frequency Releases/Changes
• Significant shorter release cycles and DevOps
– Continuous delivery/deployment
• from months at scheduled downtime to hours at all times
• Cloud uncertainty during provision/deployment
– Heavy reliance on Cloud APIs; Indirect control
– Other “sporadic” operations: cron jobs/backup/reconfig...
– Our focus: error detection/diagnosis during
continuous “changes”
• Anomaly-detection/monitoring for normal operation not working
• One solution: machine image as build artifacts?
– Heavily-baked vs. lightly-baked? Immutable server?
4. NICTA Copyright 2012 From imagination to impact
Heavily-Baked vs. Lightly-Baked
• Heavily-baked approach
+ No server drifts, consistent, more reliable?
– Image preparation time for any minor release
– Image sprawl
– Image consistency among teams
• coordination, golden image, image inheritance..
• Lightly-baked approach
+ Highly dynamic, config-as-service, less restarting…
– Less reliable due to runtime dependence on external
services (etc. repo, configuration services.. )?
– Drifting, outcome validation, race conditions..
5. NICTA Copyright 2012 From imagination to impact
Motivating Example: Rolling Upgrade
• Used in large-scale web operations
– Have 100+ servers in cloud with version 1 software
– Upgrade 10 servers at a time to version 2 software
• Potentially take a long time to complete with errors
during the operation
– Provisioning failure, logical failures, instance failure
– Other interfering operations
• Heavily-baked vs. lightly-baked
– Past experiences: Netflix Asgard with heavily-baked
– AWS OpsWorks:
• DevOps automation + life cycle events + abstraction
• Heavily-baked + built-in recipe vs. lightly-baked + custom recipe
9. NICTA Copyright 2012 From imagination to impact
Solutions for Better Reliability/Predictability
• Ad hoc tactics to reduce tails
– Inspired by Jeff Dean’s “Tail at Scale” CACM article
– Retry with alternative options
• stop-restart, replace, deploy without restart
– Fail fast
• Tracking status time and 95 percentile to fail fast
– Asynchronous waves for upgrading granularity >1
• Validate intermediary outcomes
– Inside machine:
• Chef Mini-test; test cases in production monitoring
– Outside machine:
• Process-Oriented Dependability (POD)
• Assertion checking and conformance checking
10. NICTA Copyright 2012 From imagination to impact
Process-Oriented Dependability (POD)
• Offline: treat operations as a processes
– Process discovered automatically from logs/scripts
• Log line clustering and process mining
– Expected step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states
– Errors are detected by examining logs and monitoring data
• Assertions evaluation using monitoring facilities or directly
• Compliance checking against expected processes
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
11. NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
12. NICTA Copyright 2012 From imagination to impact
Summary
• Lightly vs. heavily-baked for high frequency releases
• Solutions for unreliable processes
– Some tactics to reduce long tails
• fail fast, alternative actions, asynchronous waves…
– Validate intermediary outcomes
• Inside machine: Chef Mini-test; test cases in production monitoring
• Outside machine: Process-Oriented Dependability (POD)
– Assertion checking and conformance checking
• Currently integrating with monitoring and alerting
• We need industry help and collaboration
– Logs, trials, feedback, case study as book chapters
Book: http://www.ssrg.nicta.com.au/projects/devops_book/
Contact: Liming.Zhu@nicta.com.au