SlideShare a Scribd company logo
1 of 33
NICTA Copyright 2012 From imagination to impact
Dependable Operation
Performance Management and
Capacity Planning Under
Continuous Changes
April, 2014
Dr. Liming Zhu, Dr. Ingo Weber
NICTA/UNSW
http://slideshare.net/limingzhu
NICTA Copyright 2012 From imagination to impact
NICTA (National ICT Australia)
• Australia‟s National Centre of Excellence in
Information and Communication Technology
• Five Research Labs:
– ATP: Australian Technology Park, Sydney
– NRL: UNSW, Sydney
– CRL: ANU, Canberra
– VRL: Uni. Melbourne
– QRL: Uni. Queensland and QUT
• 700 staff including 270 PhD students
• Budget: ~$90M/yr from Fed/State Gov and
industry
• ~600 research papers/year, ~150 patents total
NICTA Copyright 2012 From imagination to impact
NICTA: Research and Outcomes
Networks
Optimisation
Machine Learning
Computer Vision
Broadband and the
Digital Economy
Infrastructure Transport
and Logistics
Security and
Environment
UniversityPartners
IndustryandGovernmentPartners
Research Excellence Wealth Creation
Engineering and
Technology Development
NICTA Copyright 2012 From imagination to impact
Software Systems Research Group (SSRG)
• Vision: Cost Effective Dependable Systems
• Two Major Activities
– Trustworthy Systems – single systems
– Dependable Cloud Computing – distributed systems
• Research history related to capacity planning
– Reve8tor/MDABench: capacity planning prototype
– Spin-out: http://www.performance-assurance.com.au/
– SPEC (spec.org) research group member
• Cloud (elasticity) benchmarking
– Keynote at ICPE 2013: “Supporting Operations Personnel
Through Performance Engineering” by Len Bass
NICTA Copyright 2012 From imagination to impact
New Challenge: Continuous Changes
• Significant shorter release cycles
– Continuous delivery/deployment: from months at
scheduled downtime to hours at all times
• Etsy.com: 25 full deployments per day at 10 commits per deploy
• Resource sharing
– Multiple sporadic operations at all times
– scaling in/out, snapshot, migration, reconfiguration,
rolling upgrade, cron-jobs, backup, recovery…
• Cloud uncertainty
– Limited visibility and indirect control
Demands continuous capacity planning and
performance management
NICTA Copyright 2012 From imagination to impact
Sporadic Operation Example: Rolling Upgrade
Update Auto-Scaling
Group (ASG)
Remove & Deregister
Old Instances from ELB
Wait for ASG to Start
New Instances
Terminate Old Instances
Register New Instances
with ELB
Sort Instances
Stop
Start
- Have 100 servers in cloud with
version 1 software
- Upgrade 10 servers at a time to
version 2 software
- No downtime or redundancy cost
- Potentially take a long time to
complete with errors during the
operation with other interfering
operations
NICTA Copyright 2012 From imagination to impact
System Monitoring During Rolling Upgrade
NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system management
– Sporadic operation knowledge
• Process-Oriented Dependability (POD): error detection
and diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
NICTA Copyright 2012 From imagination to impact
Process-Oriented Dependability (POD)
• Context
– Large-scale web/enterprise operation in Cloud
– Distributed data analytics in Cloud (Hadoop/Spark)
• Goal: detect, diagnose and react to errors
occurring during sporadic cloud operations
– Scope: “sporadic operations” (not normal operation)
• deployment, reconfiguration, (rolling) upgrade, rollback
• DevOps related: continuous integration/deploy/delivery
NICTA Copyright 2012 From imagination to impact
Operation as Process
• Offline: treat an operation as a process
– Process discovered automatically from logs/scripts
• Clustering of log lines and process mining
– Expected step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states
– Errors are detected by examining logs and monitoring data
• Assertions evaluations using monitoring facilities or directly
• Compliance checking against expected processes using logs
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
NICTA Copyright 2012 From imagination to impact
POD-Detection: Error Detection
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
Assertions:
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
Assertions:
• i has been de-registered
from ELB
• i has been removed from
ASG
• there is 1 less instance of v1
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
Assertions:
• i successfully terminated
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
Assertions:
• Next log line should appear
within 17m35s (95 percentile)
NICTA Copyright 2012 From imagination to impact
Assertion Checking: how it works
Log line:
• Remove ...
• Terminate ...
• Wait ...
• New instance ...
Assertions:
• i„ successfully launched
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
NICTA Copyright 2012 From imagination to impact
Conformance Checking: how it works
Log lines:
• Remove ...
• Terminate ...
• Wait ...
• Terminate ...???
NICTA Copyright 2012 From imagination to impact
POD-Diagnosis: how it works
• Fault frees are built as
knowledge base
• On-demand diagnosis tests
to locate the (root) causes
• Process context used for FT
pruning
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
• Experiments
– Rolling upgrade of 100+ node cluster in AWS
• Fault injection+ confounding processes: random kill, scaling-in..
• Detected errors
– Assertion checking: known errors and global errors
• Examples: key management, launch configuration, images
– Compliance checking: unknown errors
• skipping activities or undone activities
• Timing and precision
– Compared with Asgard/Mentoring internal mechanisms
• Detected more errors earlier
– Diagnosis: limited to known causes in FT
• 95 percentile less than 4s; accuracy ranges 80%~100%
NICTA Copyright 2012 From imagination to impact
Evaluation: POD-Detection/Diagnosis
NICTA Copyright 2012 From imagination to impact
Our Approach
• Incorporating change-related knowledge into
system management
– sporadic operation knowledge
• Process-Oriented Dependability: Error detection and
diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
NICTA Copyright 2012 From imagination to impact
Alerting Management using Process Context
• Do not turn off alerts during sporadic operation
• Dynamically suppressing and annotating alerts
using sporadic operation knowledge
– CPU sensitive?
– Network sensitive?
– I/O sensitive?
– Health checking sensitive?
• Benefits
– Reduce false positives of alerts
– Add context to system monitoring data for later
capacity planning and performance tuning
NICTA Copyright 2012 From imagination to impact
Availability Analysis for Sporadic Operation
• Sporadic Operation‟s Impact on Availability
– Using Stochastic Reward Network (SRN)
– Maintenance/Backup/Recovery operation
• Architecture has effect as well
Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for
availability”, IEEE 6th International Conference on Cloud Computing, June, 2013
NICTA Copyright 2012 From imagination to impact
NICTA Copyright 2012 From imagination to impact
Availability Estimation for Different
Deployment and Recovery Approaches
NICTA Copyright 2012 From imagination to impact
Event-Aware Workload Prediction
Upcoming
Event
Repository
Predict
Workload
Workload
Prediction
Event
Workload
Model
Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction
events”, International Conference on Web Information System Engineering (WISE), 2012
NICTA Copyright 2012 From imagination to impact 32
+
+
=
Time
(min)
Bids/min
Predicting Workload
Time to Predict
NICTA Copyright 2012 From imagination to impact
Summary
• System is undergoing continuous changes
– Continuous deployment + Cloud uncertainty/visibility
• Use change-related knowledge in system mgt.
– sporadic operation knowledge
• POD: Error detection and diagnosis under continuous change
• Alerting management using process context
• Availability analysis for sporadic operations
– External event knowledge
• Event-aware workload prediction
• We need industry help and collaboration
– Logs, trials, case study and feedback
Book: http://www.ssrg.nicta.com.au/projects/devops_book/
Contact: {firstname.lastname@nicta.com.au}

More Related Content

Similar to Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...Liming Zhu
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunk
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSrinath Perera
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveWalid Shaari
 
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...Tomek Borek
 
network-management Web base.ppt
network-management Web base.pptnetwork-management Web base.ppt
network-management Web base.pptAssadLeo1
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningHiroshi Wada
 
Reducing Cost with DNA Automation
Reducing Cost with DNA AutomationReducing Cost with DNA Automation
Reducing Cost with DNA AutomationCisco Canada
 
Machine learning’s impact on utilities webinar
Machine learning’s impact on utilities webinarMachine learning’s impact on utilities webinar
Machine learning’s impact on utilities webinarSparkCognition
 
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...Jon Papp
 
Sdn future of networks
Sdn future of networksSdn future of networks
Sdn future of networksAhmed El-Sayed
 
SP Network Automation: Automated Operations Overview
SP Network Automation: Automated Operations Overview SP Network Automation: Automated Operations Overview
SP Network Automation: Automated Operations Overview Cisco Service Provider
 
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...DellNMS
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkDellNMS
 
People Counting: Internet of Things in Motion at JavaOne 2013
People Counting: Internet of Things in Motion at JavaOne 2013People Counting: Internet of Things in Motion at JavaOne 2013
People Counting: Internet of Things in Motion at JavaOne 2013Eurotech
 
Apeman masta midih-oc2_demo_day
Apeman masta midih-oc2_demo_dayApeman masta midih-oc2_demo_day
Apeman masta midih-oc2_demo_dayMIDIH_EU
 
What is Platform Observability? An Overview
What is Platform Observability? An OverviewWhat is Platform Observability? An Overview
What is Platform Observability? An OverviewKumar Kolaganti
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science PlatformDecision Science Community
 

Similar to Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes (20)

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT Breakout
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
 
network-management Web base.ppt
network-management Web base.pptnetwork-management Web base.ppt
network-management Web base.ppt
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI Planning
 
Reducing Cost with DNA Automation
Reducing Cost with DNA AutomationReducing Cost with DNA Automation
Reducing Cost with DNA Automation
 
Machine learning’s impact on utilities webinar
Machine learning’s impact on utilities webinarMachine learning’s impact on utilities webinar
Machine learning’s impact on utilities webinar
 
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...
FINAL_SCFm50000_JonPapp_CAA_The_Practical_Benefits_of_a_Behavioral_Solution_f...
 
Sdn future of networks
Sdn future of networksSdn future of networks
Sdn future of networks
 
SP Network Automation: Automated Operations Overview
SP Network Automation: Automated Operations Overview SP Network Automation: Automated Operations Overview
SP Network Automation: Automated Operations Overview
 
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
 
People Counting: Internet of Things in Motion at JavaOne 2013
People Counting: Internet of Things in Motion at JavaOne 2013People Counting: Internet of Things in Motion at JavaOne 2013
People Counting: Internet of Things in Motion at JavaOne 2013
 
Apeman masta midih-oc2_demo_day
Apeman masta midih-oc2_demo_dayApeman masta midih-oc2_demo_day
Apeman masta midih-oc2_demo_day
 
What is Platform Observability? An Overview
What is Platform Observability? An OverviewWhat is Platform Observability? An Overview
What is Platform Observability? An Overview
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
 

More from Liming Zhu

AI Transformation A Clash with Human Expertise
AI TransformationA Clash with Human ExpertiseAI TransformationA Clash with Human Expertise
AI Transformation A Clash with Human ExpertiseLiming Zhu
 
Deciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AIDeciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AILiming Zhu
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AILiming Zhu
 
AI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersAI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersLiming Zhu
 
Software Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsSoftware Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsLiming Zhu
 
AI Transformation
AI TransformationAI Transformation
AI TransformationLiming Zhu
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfLiming Zhu
 
Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and DigitaltechLiming Zhu
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Liming Zhu
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AILiming Zhu
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...Liming Zhu
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsLiming Zhu
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Liming Zhu
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian ApproachLiming Zhu
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsLiming Zhu
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingLiming Zhu
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsLiming Zhu
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinLiming Zhu
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksLiming Zhu
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable OperationsLiming Zhu
 

More from Liming Zhu (20)

AI Transformation A Clash with Human Expertise
AI TransformationA Clash with Human ExpertiseAI TransformationA Clash with Human Expertise
AI Transformation A Clash with Human Expertise
 
Deciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AIDeciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AI
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AI
 
AI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersAI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future Frontiers
 
Software Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsSoftware Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based Systems
 
AI Transformation
AI TransformationAI Transformation
AI Transformation
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and Digitaltech
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and Lessons
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian Approach
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based Systems
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of Everything
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and Solutions
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risks
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable Operations
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Dependable Operation - Performance Management and Capacity Planning Under Continuous Changes

  • 1. NICTA Copyright 2012 From imagination to impact Dependable Operation Performance Management and Capacity Planning Under Continuous Changes April, 2014 Dr. Liming Zhu, Dr. Ingo Weber NICTA/UNSW http://slideshare.net/limingzhu
  • 2. NICTA Copyright 2012 From imagination to impact NICTA (National ICT Australia) • Australia‟s National Centre of Excellence in Information and Communication Technology • Five Research Labs: – ATP: Australian Technology Park, Sydney – NRL: UNSW, Sydney – CRL: ANU, Canberra – VRL: Uni. Melbourne – QRL: Uni. Queensland and QUT • 700 staff including 270 PhD students • Budget: ~$90M/yr from Fed/State Gov and industry • ~600 research papers/year, ~150 patents total
  • 3. NICTA Copyright 2012 From imagination to impact NICTA: Research and Outcomes Networks Optimisation Machine Learning Computer Vision Broadband and the Digital Economy Infrastructure Transport and Logistics Security and Environment UniversityPartners IndustryandGovernmentPartners Research Excellence Wealth Creation Engineering and Technology Development
  • 4. NICTA Copyright 2012 From imagination to impact Software Systems Research Group (SSRG) • Vision: Cost Effective Dependable Systems • Two Major Activities – Trustworthy Systems – single systems – Dependable Cloud Computing – distributed systems • Research history related to capacity planning – Reve8tor/MDABench: capacity planning prototype – Spin-out: http://www.performance-assurance.com.au/ – SPEC (spec.org) research group member • Cloud (elasticity) benchmarking – Keynote at ICPE 2013: “Supporting Operations Personnel Through Performance Engineering” by Len Bass
  • 5. NICTA Copyright 2012 From imagination to impact New Challenge: Continuous Changes • Significant shorter release cycles – Continuous delivery/deployment: from months at scheduled downtime to hours at all times • Etsy.com: 25 full deployments per day at 10 commits per deploy • Resource sharing – Multiple sporadic operations at all times – scaling in/out, snapshot, migration, reconfiguration, rolling upgrade, cron-jobs, backup, recovery… • Cloud uncertainty – Limited visibility and indirect control Demands continuous capacity planning and performance management
  • 6. NICTA Copyright 2012 From imagination to impact Sporadic Operation Example: Rolling Upgrade Update Auto-Scaling Group (ASG) Remove & Deregister Old Instances from ELB Wait for ASG to Start New Instances Terminate Old Instances Register New Instances with ELB Sort Instances Stop Start - Have 100 servers in cloud with version 1 software - Upgrade 10 servers at a time to version 2 software - No downtime or redundancy cost - Potentially take a long time to complete with errors during the operation with other interfering operations
  • 7. NICTA Copyright 2012 From imagination to impact System Monitoring During Rolling Upgrade
  • 8. NICTA Copyright 2012 From imagination to impact Our Approach • Incorporating change-related knowledge into system management – Sporadic operation knowledge • Process-Oriented Dependability (POD): error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction
  • 9. NICTA Copyright 2012 From imagination to impact Process-Oriented Dependability (POD) • Context – Large-scale web/enterprise operation in Cloud – Distributed data analytics in Cloud (Hadoop/Spark) • Goal: detect, diagnose and react to errors occurring during sporadic cloud operations – Scope: “sporadic operations” (not normal operation) • deployment, reconfiguration, (rolling) upgrade, rollback • DevOps related: continuous integration/deploy/delivery
  • 10. NICTA Copyright 2012 From imagination to impact Operation as Process • Offline: treat an operation as a process – Process discovered automatically from logs/scripts • Clustering of log lines and process mining – Expected step outcomes specified as assertions • Online: use process context – Process context: process/instance/step ids, expected states – Errors are detected by examining logs and monitoring data • Assertions evaluations using monitoring facilities or directly • Compliance checking against expected processes using logs – Detected errors are further diagnosed for (root) causes • Examining a fault tree to locate potential root causes • Performing more diagnostic tests and on-demand assertions X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
  • 11. NICTA Copyright 2012 From imagination to impact Example: Rolling Upgrade Using Asgard Read by Operator Process Mining Service Controls Outputs Create SnapshotCheck AZs Create instance from snapshot Create AMI from instance Evaluate AMI Discovered Model Asgard Log dataLog dataGenerates Offline Online
  • 12. NICTA Copyright 2012 From imagination to impact POD-Detection: Error Detection Error Detection Service has two methods for detecting errors: • Assertion Checking • Conformance Checking
  • 13. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: Assertions:
  • 14. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... Assertions: • i has been de-registered from ELB • i has been removed from ASG • there is 1 less instance of v1
  • 15. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... Assertions: • i successfully terminated
  • 16. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... Assertions: • Next log line should appear within 17m35s (95 percentile)
  • 17. NICTA Copyright 2012 From imagination to impact Assertion Checking: how it works Log line: • Remove ... • Terminate ... • Wait ... • New instance ... Assertions: • i„ successfully launched
  • 18. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines:
  • 19. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ...
  • 20. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ...
  • 21. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ...
  • 22. NICTA Copyright 2012 From imagination to impact Conformance Checking: how it works Log lines: • Remove ... • Terminate ... • Wait ... • Terminate ...???
  • 23. NICTA Copyright 2012 From imagination to impact POD-Diagnosis: how it works • Fault frees are built as knowledge base • On-demand diagnosis tests to locate the (root) causes • Process context used for FT pruning
  • 24. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis • Experiments – Rolling upgrade of 100+ node cluster in AWS • Fault injection+ confounding processes: random kill, scaling-in.. • Detected errors – Assertion checking: known errors and global errors • Examples: key management, launch configuration, images – Compliance checking: unknown errors • skipping activities or undone activities • Timing and precision – Compared with Asgard/Mentoring internal mechanisms • Detected more errors earlier – Diagnosis: limited to known causes in FT • 95 percentile less than 4s; accuracy ranges 80%~100%
  • 25. NICTA Copyright 2012 From imagination to impact Evaluation: POD-Detection/Diagnosis
  • 26. NICTA Copyright 2012 From imagination to impact Our Approach • Incorporating change-related knowledge into system management – sporadic operation knowledge • Process-Oriented Dependability: Error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction
  • 27. NICTA Copyright 2012 From imagination to impact Alerting Management using Process Context • Do not turn off alerts during sporadic operation • Dynamically suppressing and annotating alerts using sporadic operation knowledge – CPU sensitive? – Network sensitive? – I/O sensitive? – Health checking sensitive? • Benefits – Reduce false positives of alerts – Add context to system monitoring data for later capacity planning and performance tuning
  • 28. NICTA Copyright 2012 From imagination to impact Availability Analysis for Sporadic Operation • Sporadic Operation‟s Impact on Availability – Using Stochastic Reward Network (SRN) – Maintenance/Backup/Recovery operation • Architecture has effect as well Qinghua Lu, et. al. “Incorporating uncertainty into in-cloud application deployment decisions for availability”, IEEE 6th International Conference on Cloud Computing, June, 2013
  • 29. NICTA Copyright 2012 From imagination to impact
  • 30. NICTA Copyright 2012 From imagination to impact Availability Estimation for Different Deployment and Recovery Approaches
  • 31. NICTA Copyright 2012 From imagination to impact Event-Aware Workload Prediction Upcoming Event Repository Predict Workload Workload Prediction Event Workload Model Matthew Sladescu, et. al. “Event aware workload prediction: A study using auction events”, International Conference on Web Information System Engineering (WISE), 2012
  • 32. NICTA Copyright 2012 From imagination to impact 32 + + = Time (min) Bids/min Predicting Workload Time to Predict
  • 33. NICTA Copyright 2012 From imagination to impact Summary • System is undergoing continuous changes – Continuous deployment + Cloud uncertainty/visibility • Use change-related knowledge in system mgt. – sporadic operation knowledge • POD: Error detection and diagnosis under continuous change • Alerting management using process context • Availability analysis for sporadic operations – External event knowledge • Event-aware workload prediction • We need industry help and collaboration – Logs, trials, case study and feedback Book: http://www.ssrg.nicta.com.au/projects/devops_book/ Contact: {firstname.lastname@nicta.com.au}

Editor's Notes

  1. From a high level point of view, we look at the time that we’d like to predict workload at, and for that time we identify all of the active auctions. Based on the workload model associated with each auction, we can work out how many percent bids are expected at that time, and then, based on past history we can work out how many bids are expected for each auction. The sum of these expected bids can give us a prediction.