This document discusses approaches for managing systems undergoing continuous changes, such as those from continuous deployment in cloud environments. It proposes incorporating knowledge about sporadic operations and external events into system management. For sporadic operations, it describes Process-Oriented Dependability (POD) for error detection and diagnosis during operations like rolling upgrades. It also discusses using process context for alert management and availability analysis of sporadic operations. For external events, it discusses event-aware workload prediction. The goal is to better support operations personnel through performance engineering techniques that account for changes and uncertainty in cloud systems.
Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes
1. NICTA Copyright 2012 From imagination to impact
Dependable Operation
Performance Management and
Capacity Planning Under
Continuous Changes
April, 2014
Dr. Liming Zhu, Dr. Ingo Weber
NICTA/UNSW
http://slideshare.net/limingzhu
2. NICTA Copyright 2012 From imagination to impact
NICTA (National ICT Australia)
• Australia‟s National Centre of Excellence in
Information and Communication Technology
• Five Research Labs:
– ATP: Australian Technology Park, Sydney
– NRL: UNSW, Sydney
– CRL: ANU, Canberra
– VRL: Uni. Melbourne
– QRL: Uni. Queensland and QUT
• 700 staff including 270 PhD students
• Budget: ~$90M/yr from Fed/State Gov and
industry
• ~600 research papers/year, ~150 patents total
3. NICTA Copyright 2012 From imagination to impact
NICTA: Research and Outcomes
Networks
Optimisation
Machine Learning
Computer Vision
Broadband and the
Digital Economy
Infrastructure Transport
and Logistics
Security and
Environment
UniversityPartners
IndustryandGovernmentPartners
Research Excellence Wealth Creation
Engineering and
Technology Development
4. NICTA Copyright 2012 From imagination to impact
Software Systems Research Group (SSRG)
• Vision: Cost Effective Dependable Systems
• Two Major Activities
– Trustworthy Systems – single systems
– Dependable Cloud Computing – distributed systems
• Research history related to capacity planning
– Reve8tor/MDABench: capacity planning prototype
– Spin-out: http://www.performance-assurance.com.au/
– SPEC (spec.org) research group member
• Cloud (elasticity) benchmarking
– Keynote at ICPE 2013: “Supporting Operations Personnel
Through Performance Engineering” by Len Bass
5. NICTA Copyright 2012 From imagination to impact
New Challenge: Continuous Changes
• Significant shorter release cycles
– Continuous delivery/deployment: from months at
scheduled downtime to hours at all times
• Etsy.com: 25 full deployments per day at 10 commits per deploy
• Resource sharing
– Multiple sporadic operations at all times
– scaling in/out, snapshot, migration, reconfiguration,
rolling upgrade, cron-jobs, backup, recovery…
• Cloud uncertainty
– Limited visibility and indirect control
Demands continuous capacity planning and
performance management
6. NICTA Copyright 2012 From imagination to impact
Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG)
Remove & Deregister
Old Instances from ELB
Wait for ASG to Start
New Instances
Terminate Old Instances
Register New Instances
with ELB
Sort Instances
Stop
Start
- Have 100 servers in cloud with
version 1 software
- Upgrade 10 servers at a time to
version 2 software
- No downtime or redundancy cost
- Potentially take a long time to
complete with errors during the
operation with other interfering
operations
7. NICTA Copyright 2012 From imagination to impact
System Monitoring During Rolling Upgrade
8. NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system management
– Sporadic operation knowledge
• Process-Oriented Dependability (POD): error detection
and diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
9. NICTA Copyright 2012 From imagination to impact
Process-Oriented Dependability (POD)
• Context
– Large-scale web/enterprise operation in Cloud
– Distributed data analytics in Cloud (Hadoop/Spark)
• Goal: detect, diagnose and react to errors
occurring during sporadic cloud operations
– Scope: “sporadic operations” (not normal operation)
• deployment, reconfiguration, (rolling) upgrade, rollback
• DevOps related: continuous integration/deploy/delivery
10. NICTA Copyright 2012 From imagination to impact
Operation as Process
• Offline: treat an operation as a process
– Process discovered automatically from logs/scripts
• Clustering of log lines and process mining
– Expected step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states
– Errors are detected by examining logs and monitoring data
• Assertions evaluations using monitoring facilities or directly
• Compliance checking against expected processes using logs
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
11. NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
12. NICTA Copyright 2012 From imagination to impact
POD-Detection: Error Detection
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
13. NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
Assertions:
14. NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
Assertions:
• i has been de-registered
from ELB
• i has been removed from
ASG
• there is 1 less instance of v1
15. NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
Assertions:
• i successfully terminated
16. NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
Assertions:
• Next log line should appear
within 17m35s (95 percentile)
17. NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
• New instance ...
Assertions:
• i„ successfully launched
18. NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
19. NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
20. NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
21. NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
22. NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
• Terminate ...???
23. NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: how it works
• Fault frees are built as
knowledge base
• On-demand diagnosis tests
to locate the (root) causes
• Process context used for FT
pruning
24. NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100+ node cluster in AWS
• Fault injection+ confounding processes: random kill, scaling-in..
• Detected errors
– Assertion checking: known errors and global errors
• Examples: key management, launch configuration, images
– Compliance checking: unknown errors
• skipping activities or undone activities
• Timing and precision
– Compared with Asgard/Mentoring internal mechanisms
• Detected more errors earlier
– Diagnosis: limited to known causes in FT
• 95 percentile less than 4s; accuracy ranges 80%~100%
25. NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
26. NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system management
– sporadic operation knowledge
• Process-Oriented Dependability: Error detection and
diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
27. NICTA Copyright 2012 From imagination to impact
Alerting Management using Process Context
• Do not turn off alerts during sporadic operation
• Dynamically suppressing and annotating alerts
using sporadic operation knowledge
– CPU sensitive?
– Network sensitive?
– I/O sensitive?
– Health checking sensitive?
• Benefits
– Reduce false positives of alerts
– Add context to system monitoring data for later
capacity planning and performance tuning
28. NICTA Copyright 2012 From imagination to impact
Availability Analysis for Sporadic Operation
• Sporadic Operation‟s Impact on Availability
– Using Stochastic Reward Network (SRN)
– Maintenance/Backup/Recovery operation
• Architecture has effect as well
Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for
availability”, IEEE 6th International Conference on Cloud Computing, June, 2013
30. NICTA Copyright 2012 From imagination to impact
Availability Estimation for Different
Deployment and Recovery Approaches
31. NICTA Copyright 2012 From imagination to impact
Event-Aware Workload Prediction
Upcoming
Event
Repository
Predict
Workload
Workload
Prediction
Event
Workload
Model
Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction
events”, International Conference on Web Information System Engineering (WISE), 2012
32. NICTA Copyright 2012 From imagination to impact 32
+
+
=
Time
(min)
Bids/min
Predicting Workload
Time to Predict
33. NICTA Copyright 2012 From imagination to impact
Summary
• System is undergoing continuous changes
– Continuous deployment + Cloud uncertainty/visibility
• Use change-related knowledge in system mgt.
– sporadic operation knowledge
• POD: Error detection and diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
• We need industry help and collaboration
– Logs, trials, case study and feedback
Book: http://www.ssrg.nicta.com.au/projects/devops_book/
Contact: {firstname.lastname@nicta.com.au}
Editor's Notes
From a high level point of view, we look at the time that we’d like to predict workload at, and for that time we identify all of the active auctions. Based on the workload model associated with each auction, we can work out how many percent bids are expected at that time, and then, based on past history we can work out how many bids are expected for each auction. The sum of these expected bids can give us a prediction.