NICTA Copyright 2012 From imagination to impact
Dependable Operation
Dr. Liming Zhu
Software Systems Research Group
NICTA (National ICT Australia) &
University of New South Wales
DevOps Days Downunder, 2013
Liming.Zhu@nicta.com.au slideshare.net/LimingZhu/
NICTA Copyright 2012 From imagination to impact
Motivation
• Applications fail due to operation issues
– Gartner report: 80% of outage caused by people/process issues
• Sporadic activities: replication/failover, auto-scaling, upgrade…
– Not that dependability issues may trigger mitigating operations but
the converse:
• dependability, often unexpectedly, is affected by these mitigating
activities and other sporadic activities
– Lessons from our own cloud disaster recovery product:
Yuruware.com
• Complex interleaving “sporadic” processes/activities
– Scripts, tools, human
– Activities auto-triggered by policies, monitoring and analysis
– Logs/Events often lack the “process-context”
2
NICTA Copyright 2012 From imagination to impact
Our Process-Oriented Approach
• Existing artifact-oriented and state-based research
– Log analysis linking back to issues in source code
– Static configuration analysis and constraint checking
– State-based system-level models
• We treat an operation as a set of steps
– Executed by fault-prone agents (scripts/tools/human)
– Requiring various fault-prone resources (computing/nodes/environ)
– Faults at one step may surface later at another step
– Exception handling: error diagnosis, undo/redo, fixing, tolerating…
3
NICTA Copyright 2012 From imagination to impact
What We Are Working On
• Undo Framework and Undo-ability of Operations
– AWS Cloud API wrapper to allow undo
– Use AI Planning to check undo-ability and plan undo path
• Model, Monitor and Simulate Operations
– Post-condition verification and monitoring of steps
– Use monitored process context for error diagnosis and recovery
– Simulate large-scale operations: probability/time of successful
completion, bottle necks and problems
• Process Mining from Logs
– Mine a process from existing log files
– Detect deviation early or help error diagnosis
Tell us the right problems and approaches!
Liming.Zhu@nicta.com.au slideshare.net/LimingZhu/
4

Dependable Operations

  • 1.
    NICTA Copyright 2012From imagination to impact Dependable Operation Dr. Liming Zhu Software Systems Research Group NICTA (National ICT Australia) & University of New South Wales DevOps Days Downunder, 2013 Liming.Zhu@nicta.com.au slideshare.net/LimingZhu/
  • 2.
    NICTA Copyright 2012From imagination to impact Motivation • Applications fail due to operation issues – Gartner report: 80% of outage caused by people/process issues • Sporadic activities: replication/failover, auto-scaling, upgrade… – Not that dependability issues may trigger mitigating operations but the converse: • dependability, often unexpectedly, is affected by these mitigating activities and other sporadic activities – Lessons from our own cloud disaster recovery product: Yuruware.com • Complex interleaving “sporadic” processes/activities – Scripts, tools, human – Activities auto-triggered by policies, monitoring and analysis – Logs/Events often lack the “process-context” 2
  • 3.
    NICTA Copyright 2012From imagination to impact Our Process-Oriented Approach • Existing artifact-oriented and state-based research – Log analysis linking back to issues in source code – Static configuration analysis and constraint checking – State-based system-level models • We treat an operation as a set of steps – Executed by fault-prone agents (scripts/tools/human) – Requiring various fault-prone resources (computing/nodes/environ) – Faults at one step may surface later at another step – Exception handling: error diagnosis, undo/redo, fixing, tolerating… 3
  • 4.
    NICTA Copyright 2012From imagination to impact What We Are Working On • Undo Framework and Undo-ability of Operations – AWS Cloud API wrapper to allow undo – Use AI Planning to check undo-ability and plan undo path • Model, Monitor and Simulate Operations – Post-condition verification and monitoring of steps – Use monitored process context for error diagnosis and recovery – Simulate large-scale operations: probability/time of successful completion, bottle necks and problems • Process Mining from Logs – Mine a process from existing log files – Detect deviation early or help error diagnosis Tell us the right problems and approaches! Liming.Zhu@nicta.com.au slideshare.net/LimingZhu/ 4