NICTA Copyright 2012 From imagination to impact
Dependable Operation
Performance Management and
Capacity Planning Under
Co...
NICTA Copyright 2012 From imagination to impact
NICTA (National ICT Australia)
• Australia‟s National Centre of Excellence...
NICTA Copyright 2012 From imagination to impact
NICTA: Research and Outcomes
Networks
Optimisation
Machine Learning
Comput...
NICTA Copyright 2012 From imagination to impact
Software Systems Research Group (SSRG)
• Vision: Cost Effective Dependable...
NICTA Copyright 2012 From imagination to impact
New Challenge: Continuous Changes
• Significant shorter release cycles
– C...
NICTA Copyright 2012 From imagination to impact
Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG...
NICTA Copyright 2012 From imagination to impact
System Monitoring During Rolling Upgrade
NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system manageme...
NICTA Copyright 2012 From imagination to impact
Process-Oriented Dependability (POD)
• Context
– Large-scale web/enterpris...
NICTA Copyright 2012 From imagination to impact
Operation as Process
• Offline: treat an operation as a process
– Process ...
NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Serv...
NICTA Copyright 2012 From imagination to impact
POD-Detection: Error Detection
Error Detection Service has two
methods for...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
Assertions:
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
Assertions:
• i ha...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
As...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• ...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ......
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ......
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: how it works
• Fault frees are built as
knowledge base
• On...
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100...
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system manageme...
NICTA Copyright 2012 From imagination to impact
Alerting Management using Process Context
• Do not turn off alerts during ...
NICTA Copyright 2012 From imagination to impact
Availability Analysis for Sporadic Operation
• Sporadic Operation‟s Impact...
NICTA Copyright 2012 From imagination to impact
NICTA Copyright 2012 From imagination to impact
Availability Estimation for Different
Deployment and Recovery Approaches
NICTA Copyright 2012 From imagination to impact
Event-Aware Workload Prediction
Upcoming
Event
Repository
Predict
Workload...
NICTA Copyright 2012 From imagination to impact 32
+
+
=
Time
(min)
Bids/min
Predicting Workload
Time to Predict
NICTA Copyright 2012 From imagination to impact
Summary
• System is undergoing continuous changes
– Continuous deployment ...
Upcoming SlideShare
Loading in …5
×

Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

706 views

Published on

Talk at http://www.cmga.org.au/ Meet up
Modern large-scale applications experience sporadic changes due to operational activities such as upgrade, redeployment, on-demand scaling and interferences from other simultaneous operations. This poses new challenges in system monitoring, capacity planning, performance management, error detection and diagnosis. For example, the traditional anomaly-detection-based techniques are less effective during the “sporadic” operation period as a wide range of legitimate changes confound the situation and make performance baseline establishment for “normal” operation difficult. The increasing frequency of these sporadic operations (e.g. due to continuous deployment) is exacerbating the problem. In this talk, we will introduce a number of ongoing research activities at NICTA addressing these issues. For example, we propose the Process Oriented Dependability (POD) approach, an approach that explicitly models these sporadic operations as processes and uses the process context to filter logs, traverse fault trees and conduct adaptive monitoring.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
706
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
16
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • From a high level point of view, we look at the time that we’d like to predict workload at, and for that time we identify all of the active auctions. Based on the workload model associated with each auction, we can work out how many percent bids are expected at that time, and then, based on past history we can work out how many bids are expected for each auction. The sum of these expected bids can give us a prediction.
  • Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

    1. 1. NICTA Copyright 2012 From imagination to impact Dependable Operation Performance Management and Capacity Planning Under Continuous Changes April, 2014 Dr. Liming Zhu, Dr. Ingo Weber NICTA/UNSW http://slideshare.net/limingzhu
    2. 2. NICTA Copyright 2012 From imagination to impact NICTA (National ICT Australia) • Australia‟s National Centre of Excellence in Information and Communication Technology • Five Research Labs: – ATP: Australian Technology Park, Sydney – NRL: UNSW, Sydney – CRL: ANU, Canberra – VRL: Uni. Melbourne – QRL: Uni. Queensland and QUT • 700 staff including 270 PhD students • Budget: ~$90M/yr from Fed/State Gov and industry • ~600 research papers/year, ~150 patents total
    3. 3. NICTA Copyright 2012 From imagination to impact NICTA: Research and Outcomes Networks Optimisation Machine Learning Computer Vision Broadband and the Digital Economy Infrastructure Transport and Logistics Security and Environment UniversityPartners IndustryandGovernmentPartners Research Excellence Wealth Creation Engineering and Technology Development
    4. 4. NICTA Copyright 2012 From imagination to impact Software Systems Research Group (SSRG) • Vision: Cost Effective Dependable Systems • Two Major Activities – Trustworthy Systems – single systems – Dependable Cloud Computing – distributed systems • Research history related to capacity planning – Reve8tor/MDABench: capacity planning prototype – Spin-out: http://www.performance-assurance.com.au/ – SPEC (spec.org) research group member • Cloud (elasticity) benchmarking – Keynote at ICPE 2013: “Supporting Operations Personnel Through Performance Engineering” by Len Bass
    5. 5. NICTA Copyright 2012 From imagination to impact New Challenge: Continuous Changes • Significant shorter release cycles – Continuous delivery/deployment: from months at scheduled downtime to hours at all times • Etsy.com: 25 full deployments per day at 10 commits per deploy • Resource sharing – Multiple sporadic operations at all times – scaling in/out, snapshot, migration, reconfiguration, rolling upgrade, cron-jobs, backup, recovery… • Cloud uncertainty – Limited visibility and indirect control Demands continuous capacity planning and performance management
    6. 6. NICTA Copyright 2012 From imagination to impact Sporadic Operation Example: Rolling Upgrade Update Auto-Scaling Group (ASG) Remove & Deregister Old Instances from ELB Wait for ASG to Start New Instances Terminate Old Instances Register New Instances with ELB Sort Instances Stop Start - Have 100 servers in cloud with version 1 software - Upgrade 10 servers at a time to version 2 software - No downtime or redundancy cost - Potentially take a long time to complete with errors during the operation with other interfering operations
    7. 7. NICTA Copyright 2012 From imagination to impact System Monitoring During Rolling Upgrade
    8. 8. NICTA Copyright 2012 From imagination to impact Our Approach • Incorporating change-related knowledge into system management – Sporadic operation knowledge • Process-Oriented Dependability (POD): error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction
    9. 9. NICTA Copyright 2012 From imagination to impact Process-Oriented Dependability (POD) • Context – Large-scale web/enterprise operation in Cloud – Distributed data analytics in Cloud (Hadoop/Spark) • Goal: detect, diagnose and react to errors occurring during sporadic cloud operations – Scope: “sporadic operations” (not normal operation) • deployment, reconfiguration, (rolling) upgrade, rollback • DevOps related: continuous integration/deploy/delivery
    10. 10. NICTA Copyright 2012 From imagination to impact Operation as Process • Offline: treat an operation as a process – Process discovered automatically from logs/scripts • Clustering of log lines and process mining – Expected step outcomes specified as assertions • Online: use process context – Process context: process/instance/step ids, expected states – Errors are detected by examining logs and monitoring data • Assertions evaluations using monitoring facilities or directly • Compliance checking against expected processes using logs – Detected errors are further diagnosed for (root) causes • Examining a fault tree to locate potential root causes • Performing more diagnostic tests and on-demand assertions X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
    11. 11. NICTA Copyright 2012 From imagination to impact Example: Rolling Upgrade Using Asgard Read by Operator Process Mining Service Controls Outputs Create SnapshotCheck AZs Create instance from snapshot Create AMI from instance Evaluate AMI Discovered Model Asgard Log dataLog dataGenerates Offline Online
    12. 12. NICTA Copyright 2012 From imagination to impact POD-Detection: Error Detection Error Detection Service has two methods for detecting errors: • Assertion Checking • Conformance Checking
    13. 13. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: Assertions:
    14. 14. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... Assertions: • i has been de-registered from ELB • i has been removed from ASG • there is 1 less instance of v1
    15. 15. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... Assertions: • i successfully terminated
    16. 16. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... Assertions: • Next log line should appear within 17m35s (95 percentile)
    17. 17. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... • New instance ... Assertions: • i„ successfully launched
    18. 18. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines:
    19. 19. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ...
    20. 20. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ...
    21. 21. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ...
    22. 22. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... • Terminate ...???
    23. 23. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: how it works • Fault frees are built as knowledge base • On-demand diagnosis tests to locate the (root) causes • Process context used for FT pruning
    24. 24. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis • Experiments – Rolling upgrade of 100+ node cluster in AWS • Fault injection+ confounding processes: random kill, scaling-in.. • Detected errors – Assertion checking: known errors and global errors • Examples: key management, launch configuration, images – Compliance checking: unknown errors • skipping activities or undone activities • Timing and precision – Compared with Asgard/Mentoring internal mechanisms • Detected more errors earlier – Diagnosis: limited to known causes in FT • 95 percentile less than 4s; accuracy ranges 80%~100%
    25. 25. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis
    26. 26. NICTA Copyright 2012 From imagination to impact Our Approach • Incorporating change-related knowledge into system management – sporadic operation knowledge • Process-Oriented Dependability: Error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction
    27. 27. NICTA Copyright 2012 From imagination to impact Alerting Management using Process Context • Do not turn off alerts during sporadic operation • Dynamically suppressing and annotating alerts using sporadic operation knowledge – CPU sensitive? – Network sensitive? – I/O sensitive? – Health checking sensitive? • Benefits – Reduce false positives of alerts – Add context to system monitoring data for later capacity planning and performance tuning
    28. 28. NICTA Copyright 2012 From imagination to impact Availability Analysis for Sporadic Operation • Sporadic Operation‟s Impact on Availability – Using Stochastic Reward Network (SRN) – Maintenance/Backup/Recovery operation • Architecture has effect as well Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for availability”, IEEE 6th International Conference on Cloud Computing, June, 2013
    29. 29. NICTA Copyright 2012 From imagination to impact
    30. 30. NICTA Copyright 2012 From imagination to impact Availability Estimation for Different Deployment and Recovery Approaches
    31. 31. NICTA Copyright 2012 From imagination to impact Event-Aware Workload Prediction Upcoming Event Repository Predict Workload Workload Prediction Event Workload Model Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction events”, International Conference on Web Information System Engineering (WISE), 2012
    32. 32. NICTA Copyright 2012 From imagination to impact 32 + + = Time (min) Bids/min Predicting Workload Time to Predict
    33. 33. NICTA Copyright 2012 From imagination to impact Summary • System is undergoing continuous changes – Continuous deployment + Cloud uncertainty/visibility • Use change-related knowledge in system mgt. – sporadic operation knowledge • POD: Error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction • We need industry help and collaboration – Logs, trials, case study and feedback Book: http://www.ssrg.nicta.com.au/projects/devops_book/ Contact: {firstname.lastname@nicta.com.au}

    ×