Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: Error Detection
and Diagnosis of Sporadic
Operations on Clo...
NICTA Copyright 2012 From imagination to impact
Outline
• Dependable Cloud Operation
• Approach: Process-Oriented Dependab...
NICTA Copyright 2012 From imagination to impact
Dependable Cloud Operation: Motivation
• Sporadic operations cause most ou...
NICTA Copyright 2012 From imagination to impact
Dependable Cloud Operation: Challenges
• Our Context
– Large-scale web/ent...
NICTA Copyright 2012 From imagination to impact
Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG...
NICTA Copyright 2012 From imagination to impact
Challenge 1: Anomaly Detection
• Traditional anomaly-based error detection...
NICTA Copyright 2012 From imagination to impact
Our Approach: Use Process Context
• Offline: treat an operation as a proce...
NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Serv...
NICTA Copyright 2012 From imagination to impact
Process Mining Service: how it works
• Process Mining: Discovery
1. Collec...
NICTA Copyright 2012 From imagination to impact
POD-Detection: Error Detection
Error Detection Service has two
methods for...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
Assertions:
11
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
Assertions:
• i ha...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
As...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• ...
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
16
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
17
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ......
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ......
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ......
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: how it works
• Fault trees are built as
knowledge base
• Pr...
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100...
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
23
NICTA Copyright 2012 From imagination to impact
Other Related Research
Challenges
1. Anomaly detection during sporadic ope...
NICTA Copyright 2012 From imagination to impact
Challenge 2: Undo/Recovery Planning
S1 S2
Serr
A certain
step
Reparation
C...
NICTA Copyright 2012 From imagination to impact
Undo/Undoability Approach in a Nutshell
• Goal: undo support for
“indirect...
NICTA Copyright 2012 From imagination to impact
Undoability Checking Approach
Operation(s) to execute
(e.g., script, comma...
NICTA Copyright 2012 From imagination to impact
Challenge 3: Modeling and Analysis
• Approach: Model as stochastic process...
NICTA Copyright 2012 From imagination to impact
Model used for
Predictions
- e.g. completion time,
failure rate impact
Opt...
NICTA Copyright 2012 From imagination to impact
Connection with AMPLab BDAS
30
NICTA Copyright 2012 From imagination to impact
Projects Related to BDAS (1/2)
1. Log/Metrics analysis in POD-Diagnosis
– ...
NICTA Copyright 2012 From imagination to impact
Projects Related to BDAS (2/2)
Redacted
4. Data scientist workflow and loc...
NICTA Copyright 2012 From imagination to impact
Team Acknowledgement
• Researchers
– Len Bass
– Alan Fekete
– Anna Liu
– D...
Upcoming SlideShare
Loading in …5
×

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

1,944 views

Published on

My talk at Berkeley AMPLab (http://amplab.cs.berkeley.edu) on dependable cloud operation and big data analytics infrastructure.

Published in: Software, Technology
  • Be the first to comment

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications

  1. 1. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud Applications Dr. Liming Zhu Liming.Zhu@nicta.com.au Principal Researcher, NICTA/UNSW April, 2014 at Berkeley AMPLab
  2. 2. NICTA Copyright 2012 From imagination to impact Outline • Dependable Cloud Operation • Approach: Process-Oriented Dependability (POD) – POD-Diagnosis – Undo/Recovery Planning using AI Planning – Modeling and Analysis using DTMC • Connections with AMPLab BDAS 2
  3. 3. NICTA Copyright 2012 From imagination to impact Dependable Cloud Operation: Motivation • Sporadic operations cause most outages – Deployment, reconfiguration, (rolling) upgrade, rollback… • as opposed to normal operations – DevOps-related: continuous integration/deploy/delivery • Etsy.com: 25 full deployments per day at 10 commits per deploy – Other drivers: resource sharing, micro services/partition migration, backup/recovery, auto-mitigation itself… • Limited control & visibility during sporadic operation – Heavy reliance on Cloud APIs – Limited visibility and exception handling capabilities 3
  4. 4. NICTA Copyright 2012 From imagination to impact Dependable Cloud Operation: Challenges • Our Context – Large-scale web/enterprise operation in Cloud – Distributed data analytics in Cloud (Hadoop/Spark) • Goal: detect, diagnose and react to errors occurring during a sporadic cloud operation • Challenges 1. Anomaly detection during sporadic operations 2. Undo/Recovery planning for recovery 3. Modelling and analysis of sporadic operation 4
  5. 5. NICTA Copyright 2012 From imagination to impact Sporadic Operation Example: Rolling Upgrade Update Auto-Scaling Group (ASG) Remove & Deregister Old Instances from ELB Wait for ASG to Start New Instances Terminate Old Instances Register New Instances with ELB Sort Instances Stop Start - Have 100 servers in cloud with version 1 software - Upgrade 10 servers at a time to version 2 software - No downtime or redundancy cost - Potentially take a long time to complete with errors during the operation with other interfering operations 5
  6. 6. NICTA Copyright 2012 From imagination to impact Challenge 1: Anomaly Detection • Traditional anomaly-based error detection is designed for “normal operation” – significant false positives OR disable all monitoring during sporadic operation • Continuous changes to the production systems – From months at scheduled downtime to hours at all times – Multiple operations at the same time • Quality of automation scripts + human – fully testing the operation (scripts + human) in uncertain cloud environment is very difficult 6
  7. 7. NICTA Copyright 2012 From imagination to impact Our Approach: Use Process Context • Offline: treat an operation as a process – Process discovered automatically from logs/scripts • Clustering of log lines and process mining – Intermediary step outcomes specified as assertions • Online: use process context – Process context: process/instance/step ids, expected states… – Errors detected by examining logs and monitoring data • Assertions evaluations integration with monitoring facilities • Compliance checking against expected processes using logs – Detected errors are further diagnosed for (root) causes • Examining a fault tree to locate potential root causes • Performing more diagnostic tests and on-demand assertions X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. 7
  8. 8. NICTA Copyright 2012 From imagination to impact Example: Rolling Upgrade Using Asgard Read by Operator Process Mining Service Controls Outputs Create SnapshotCheck AZs Create instance from snapshot Create AMI from instance Evaluate AMI Discovered Model Asgard Log dataLog dataGenerates Offline Online 8
  9. 9. NICTA Copyright 2012 From imagination to impact Process Mining Service: how it works • Process Mining: Discovery 1. Collect the logs (using Logstash) 2. Filter the logs 3. Calculating string distance (Levenshtein distance) between each pair of log lines 4. Cluster the log lines 5. Look at the dendrogram to decide on threshold 6. Name & combine clusters 7. Derive regular expressions for the clusters 8. Classify the log lines using the regular expressions and cluster names 9. Import altered log into process mining tools 10. Apply different process discovery algorithms 11. If anything requires changes, go back to the respective steps and redo from there 9
  10. 10. NICTA Copyright 2012 From imagination to impact POD-Detection: Error Detection Error Detection Service has two methods for detecting errors: • Assertion Checking • Conformance Checking 10
  11. 11. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: Assertions: 11
  12. 12. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... Assertions: • i has been de-registered from ELB • i has been removed from ASG • there is 1 less instance of v1 12
  13. 13. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... Assertions: • i successfully terminated 13
  14. 14. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... Assertions: • Next log line should appear within 17m35s (95 percentile) 14
  15. 15. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... • New instance ... Assertions: • i„ successfully launched 15
  16. 16. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: 16
  17. 17. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... 17
  18. 18. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... 18
  19. 19. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... 19
  20. 20. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... • Terminate ...??? 20
  21. 21. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: how it works • Fault trees are built as knowledge base • Process context used for fault tree pruning • On-demand diagnosis tests to locate the (root) causes 21
  22. 22. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis • Experiments – Rolling upgrade of 100+ node cluster in AWS • Fault injection+ confounding processes: random kill, scaling-in.. • Detected errors – Assertion checking: known errors and global errors • Examples: key management, launch configuration, images… – Compliance checking: unknown errors • skipping activities or undone activities • Time and precision – Compared with Asgard/Monitoring internal mechanisms • Detected more errors earlier – Diagnosis: limited to known causes in the fault tree • 95 percentile less than 4s; accuracy ranges 80%~100% 22
  23. 23. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis 23
  24. 24. NICTA Copyright 2012 From imagination to impact Other Related Research Challenges 1. Anomaly detection during sporadic operations 2. Undo/Recovery planning 3. Modelling and analysis of sporadic operation 24
  25. 25. NICTA Copyright 2012 From imagination to impact Challenge 2: Undo/Recovery Planning S1 S2 Serr A certain step Reparation Compensation Undo Parameterizable Redo Alternative Checkpoint-base Undo Previous states … ... S0S-i 25
  26. 26. NICTA Copyright 2012 From imagination to impact Undo/Undoability Approach in a Nutshell • Goal: undo support for “indirect control” setting – Problem 1: some actions are irreversible, e.g., delete – Problem 2: undo ≠ copy back previous state of memory • Have to call the right actions on the right resources in the right order – Problem 3: partly irreversible operations, e.g. on Amazon WS: • Stopping a machine disassociates an elastic IP address (if any), and releases internal IP / public DNS • Starting the machine isn‟t undo: elastic IP is dangling, internal IP / public DNS / timestamps are different • Solution components:  Replace “do” with “pseudo-do”  Undo System based on AI Planning • Outcome: sequence of undo actions  Undoability Checking: • Is the operation I‟m about to execute undoable? • Learn which aspects can be fully undone for each operation (whole domain) • If not, can we abstract / change so that undoability is given?  Projection (of a domain) 26 Ingo Weber et. al. Supporting undoability in systems operations. In USENIX LISA'13: Large Installation System Administration Conference, Washington, DC, USA, November 2013.
  27. 27. NICTA Copyright 2012 From imagination to impact Undoability Checking Approach Operation(s) to execute (e.g., script, command) Resources and properties required to be undoable Define Tool user (e.g., sys admin) Tool provider Full domain model (e.g., AWS) Projection Specification Generate Undoability CheckerDefine Apply Projection Generate Projected domain model Per operation: Generate pre and post-states Check undoability per pre-post state pair Undoability (yes/no) List of causes if not undoable Result Feedback For each pair: call AI Planner 27
  28. 28. NICTA Copyright 2012 From imagination to impact Challenge 3: Modeling and Analysis • Approach: Model as stochastic processes – Discrete/Continuous Markov Chain (DTMC/CTMC) • Forward states: net successful operations • Backward states: failure or deliberate rollback/undo • A family of g-k chains with different parameters – g: rolling-upgrade wave granularity. k: no. of failure/rollback per wave Daniel Sun & L Zhu, et. al. ” Understanding Rolling Upgrade” 33th International Symposium on Reliable Distributed Systems (SRDS), 2014 (submitted) 28
  29. 29. NICTA Copyright 2012 From imagination to impact Model used for Predictions - e.g. completion time, failure rate impact Optimization and Decision Problems - e.g. when to activate new versions to guarantee a 99.99% success 29
  30. 30. NICTA Copyright 2012 From imagination to impact Connection with AMPLab BDAS 30
  31. 31. NICTA Copyright 2012 From imagination to impact Projects Related to BDAS (1/2) 1. Log/Metrics analysis in POD-Diagnosis – Currently using Spark/MLBase – Voluminous log/events into Spark Streaming 2. Dependable deployment/operation of BDAS – POD applied to Hadoop before, maybe BDAS? 3. Multi-level granularity access for data analytics – Australian Urban Research Infrastructure Network (AURIN) • Portal to provide transport-related data to international researchers • Cluster sharing for in-portal pre-processing and analytics • de-anonymization concerns and different views for the same data – Evaluating how BDAS can support this 31
  32. 32. NICTA Copyright 2012 From imagination to impact Projects Related to BDAS (2/2) Redacted 4. Data scientist workflow and local exploration 5. Distributed machine learning 32
  33. 33. NICTA Copyright 2012 From imagination to impact Team Acknowledgement • Researchers – Len Bass – Alan Fekete – Anna Liu – Daniel Sun – Hiroshi Wada – Ingo Weber – Sherry Xu – Liming Zhu • Engineers – Adnene Guabtni – Chao Li • Students – Amer Abdalamer – Ahmed Alqahtani – Mostafa Farshchi – Min Fu – Jin Li – Matthew Sladescu – Donna Xu – DongYao Wu 33

×