SlideShare a Scribd company logo
1 of 12
NICTA Copyright 2012 From imagination to impact
Challenges in Practicing
High Frequency Releases in
Cloud Environments
Liming Zhu, Donna Xu, Xiwei Xu, An Binh
Tran, Ingo Weber, Len Bass
NICTA/UNSW
http://slideshare.net/limingzhu
NICTA Copyright 2012 From imagination to impact
NICTA (National ICT Australia)
• Australia’s National Centre of Excellence in
Information and Communication Technology
• Five Research Labs:
– ATP: Australian Technology Park, Sydney
– NRL: UNSW, Sydney
– CRL: ANU, Canberra
– VRL: Uni. Melbourne
– QRL: Uni. Queensland and QUT
• 700 staff including 270 PhD students
• Research Groups
– Software Systems Research Group (SSRG)
• ssrg.nicta.com.au
– Machine Learning, Optimisation, Networks,
Computer Vision
NICTA Copyright 2012 From imagination to impact
Challenge: High Frequency Releases/Changes
• Significant shorter release cycles and DevOps
– Continuous delivery/deployment
• from months at scheduled downtime to hours at all times
• Cloud uncertainty during provision/deployment
– Heavy reliance on Cloud APIs; Indirect control
– Other “sporadic” operations: cron jobs/backup/reconfig...
– Our focus: error detection/diagnosis during
continuous “changes”
• Anomaly-detection/monitoring for normal operation not working
• One solution: machine image as build artifacts?
– Heavily-baked vs. lightly-baked? Immutable server?
NICTA Copyright 2012 From imagination to impact
Heavily-Baked vs. Lightly-Baked
• Heavily-baked approach
+ No server drifts, consistent, more reliable?
– Image preparation time for any minor release
– Image sprawl
– Image consistency among teams
• coordination, golden image, image inheritance..
• Lightly-baked approach
+ Highly dynamic, config-as-service, less restarting…
– Less reliable due to runtime dependence on external
services (etc. repo, configuration services.. )?
– Drifting, outcome validation, race conditions..
NICTA Copyright 2012 From imagination to impact
Motivating Example: Rolling Upgrade
• Used in large-scale web operations
– Have 100+ servers in cloud with version 1 software
– Upgrade 10 servers at a time to version 2 software
• Potentially take a long time to complete with errors
during the operation
– Provisioning failure, logical failures, instance failure
– Other interfering operations
• Heavily-baked vs. lightly-baked
– Past experiences: Netflix Asgard with heavily-baked
– AWS OpsWorks:
• DevOps automation + life cycle events + abstraction
• Heavily-baked + built-in recipe vs. lightly-baked + custom recipe
NICTA Copyright 2012 From imagination to impact
Observations 1/3
NICTA Copyright 2012 From imagination to impact
Observations 2/3
NICTA Copyright 2012 From imagination to impact
Observations 3/3
NICTA Copyright 2012 From imagination to impact
Solutions for Better Reliability/Predictability
• Ad hoc tactics to reduce tails
– Inspired by Jeff Dean’s “Tail at Scale” CACM article
– Retry with alternative options
• stop-restart, replace, deploy without restart
– Fail fast
• Tracking status time and 95 percentile to fail fast
– Asynchronous waves for upgrading granularity >1
• Validate intermediary outcomes
– Inside machine:
• Chef Mini-test; test cases in production monitoring
– Outside machine:
• Process-Oriented Dependability (POD)
• Assertion checking and conformance checking
NICTA Copyright 2012 From imagination to impact
Process-Oriented Dependability (POD)
• Offline: treat operations as a processes
– Process discovered automatically from logs/scripts
• Log line clustering and process mining
– Expected step outcomes specified as assertions
• Online: use process context
– Process context: process/instance/step ids, expected states
– Errors are detected by examining logs and monitoring data
• Assertions evaluation using monitoring facilities or directly
• Compliance checking against expected processes
– Detected errors are further diagnosed for (root) causes
• Examining a fault tree to locate potential root causes
• Performing more diagnostic tests and on-demand assertions
X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
NICTA Copyright 2012 From imagination to impact
Example: Rolling Upgrade Using Asgard
Read by
Operator
Process
Mining
Service
Controls
Outputs Create SnapshotCheck AZs
Create instance
from snapshot
Create AMI from
instance
Evaluate AMI
Discovered
Model
Asgard Log dataLog dataGenerates
Offline
Online
Error Detection Service has two
methods for detecting errors:
• Assertion Checking
• Conformance Checking
NICTA Copyright 2012 From imagination to impact
Summary
• Lightly vs. heavily-baked for high frequency releases
• Solutions for unreliable processes
– Some tactics to reduce long tails
• fail fast, alternative actions, asynchronous waves…
– Validate intermediary outcomes
• Inside machine: Chef Mini-test; test cases in production monitoring
• Outside machine: Process-Oriented Dependability (POD)
– Assertion checking and conformance checking
• Currently integrating with monitoring and alerting
• We need industry help and collaboration
– Logs, trials, feedback, case study as book chapters
Book: http://www.ssrg.nicta.com.au/projects/devops_book/
Contact: Liming.Zhu@nicta.com.au

More Related Content

What's hot

Group Health Cooperative Customer Presentation
Group Health Cooperative Customer PresentationGroup Health Cooperative Customer Presentation
Group Health Cooperative Customer PresentationSplunk
 
Customer Presentation - Cerner
Customer Presentation - CernerCustomer Presentation - Cerner
Customer Presentation - CernerSplunk
 
AdvancedMD Customer Presentation
AdvancedMD Customer PresentationAdvancedMD Customer Presentation
AdvancedMD Customer PresentationSplunk
 
Splunk live university of alberta 2015
Splunk live university of alberta 2015Splunk live university of alberta 2015
Splunk live university of alberta 2015dostatni
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's MonitoringBrian Overstreet
 
Cisco and Splunk: Under the Hood of Cisco IT Breakout Session
Cisco and Splunk: Under the Hood of Cisco IT Breakout SessionCisco and Splunk: Under the Hood of Cisco IT Breakout Session
Cisco and Splunk: Under the Hood of Cisco IT Breakout SessionSplunk
 
Facilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentFacilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentKurt Andersen
 
SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.Splunk
 
University of Alberta Customer Presentation
University of Alberta Customer PresentationUniversity of Alberta Customer Presentation
University of Alberta Customer PresentationSplunk
 
SplunkLive! Austin Customer Presentation - Xerox
SplunkLive! Austin Customer Presentation - XeroxSplunkLive! Austin Customer Presentation - Xerox
SplunkLive! Austin Customer Presentation - XeroxSplunk
 
History of data centric transformation preso
History of data centric transformation presoHistory of data centric transformation preso
History of data centric transformation presoDarren Pulsipher
 
IEEE Buenaventura cs Chapter March 9 2016 v4
IEEE Buenaventura cs Chapter March 9 2016  v4IEEE Buenaventura cs Chapter March 9 2016  v4
IEEE Buenaventura cs Chapter March 9 2016 v4Sailaja Tennati
 
WestJet Customer Presentation
WestJet Customer PresentationWestJet Customer Presentation
WestJet Customer PresentationSplunk
 
Spirent Accelerating SDN and NFV Deployments
Spirent Accelerating SDN and NFV DeploymentsSpirent Accelerating SDN and NFV Deployments
Spirent Accelerating SDN and NFV DeploymentsSailaja Tennati
 
SplunkLive! Austin Customer Presentation - Baylor
SplunkLive! Austin Customer Presentation - BaylorSplunkLive! Austin Customer Presentation - Baylor
SplunkLive! Austin Customer Presentation - BaylorSplunk
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017MLconf
 
Mistral and StackStorm
Mistral and StackStormMistral and StackStorm
Mistral and StackStormDmitri Zimine
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm
 

What's hot (19)

Group Health Cooperative Customer Presentation
Group Health Cooperative Customer PresentationGroup Health Cooperative Customer Presentation
Group Health Cooperative Customer Presentation
 
Customer Presentation - Cerner
Customer Presentation - CernerCustomer Presentation - Cerner
Customer Presentation - Cerner
 
Chaudhry_Gap Analysis_IEEE_HQs
Chaudhry_Gap Analysis_IEEE_HQsChaudhry_Gap Analysis_IEEE_HQs
Chaudhry_Gap Analysis_IEEE_HQs
 
AdvancedMD Customer Presentation
AdvancedMD Customer PresentationAdvancedMD Customer Presentation
AdvancedMD Customer Presentation
 
Splunk live university of alberta 2015
Splunk live university of alberta 2015Splunk live university of alberta 2015
Splunk live university of alberta 2015
 
Scaling Pinterest's Monitoring
Scaling Pinterest's MonitoringScaling Pinterest's Monitoring
Scaling Pinterest's Monitoring
 
Cisco and Splunk: Under the Hood of Cisco IT Breakout Session
Cisco and Splunk: Under the Hood of Cisco IT Breakout SessionCisco and Splunk: Under the Hood of Cisco IT Breakout Session
Cisco and Splunk: Under the Hood of Cisco IT Breakout Session
 
Facilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital EnvironmentFacilitating DevOps Execution in an All Digital Environment
Facilitating DevOps Execution in an All Digital Environment
 
SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.
 
University of Alberta Customer Presentation
University of Alberta Customer PresentationUniversity of Alberta Customer Presentation
University of Alberta Customer Presentation
 
SplunkLive! Austin Customer Presentation - Xerox
SplunkLive! Austin Customer Presentation - XeroxSplunkLive! Austin Customer Presentation - Xerox
SplunkLive! Austin Customer Presentation - Xerox
 
History of data centric transformation preso
History of data centric transformation presoHistory of data centric transformation preso
History of data centric transformation preso
 
IEEE Buenaventura cs Chapter March 9 2016 v4
IEEE Buenaventura cs Chapter March 9 2016  v4IEEE Buenaventura cs Chapter March 9 2016  v4
IEEE Buenaventura cs Chapter March 9 2016 v4
 
WestJet Customer Presentation
WestJet Customer PresentationWestJet Customer Presentation
WestJet Customer Presentation
 
Spirent Accelerating SDN and NFV Deployments
Spirent Accelerating SDN and NFV DeploymentsSpirent Accelerating SDN and NFV Deployments
Spirent Accelerating SDN and NFV Deployments
 
SplunkLive! Austin Customer Presentation - Baylor
SplunkLive! Austin Customer Presentation - BaylorSplunkLive! Austin Customer Presentation - Baylor
SplunkLive! Austin Customer Presentation - Baylor
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Mistral and StackStorm
Mistral and StackStormMistral and StackStorm
Mistral and StackStorm
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
 

Similar to Challenges in Practicing High Frequency Releases in Cloud Environments

Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Liming Zhu
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Splunk
 
API Training 10 Nov 2014
API Training 10 Nov 2014API Training 10 Nov 2014
API Training 10 Nov 2014Digital Bond
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunk
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningHiroshi Wada
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveWalid Shaari
 
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogic
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogicWebinar: Improve Splunk Analytics and Automate Processes with SnapLogic
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogicSnapLogic
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoopLen Bass
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsSriram Krishnan
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
network-management Web base.ppt
network-management Web base.pptnetwork-management Web base.ppt
network-management Web base.pptAssadLeo1
 
OpenStack Enabling DevOps
OpenStack Enabling DevOpsOpenStack Enabling DevOps
OpenStack Enabling DevOpsCisco DevNet
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 

Similar to Challenges in Practicing High Frequency Releases in Cloud Environments (20)

Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Planning open stack-poc
Planning open stack-pocPlanning open stack-poc
Planning open stack-poc
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
Delivering Operational Intelligence at NAB with Splunk, Gartner Symposium ITX...
 
API Training 10 Nov 2014
API Training 10 Nov 2014API Training 10 Nov 2014
API Training 10 Nov 2014
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT Breakout
 
Automatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI PlanningAutomatic Undo for Cloud Management via AI Planning
Automatic Undo for Cloud Management via AI Planning
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogic
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogicWebinar: Improve Splunk Analytics and Automate Processes with SnapLogic
Webinar: Improve Splunk Analytics and Automate Processes with SnapLogic
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
network-management Web base.ppt
network-management Web base.pptnetwork-management Web base.ppt
network-management Web base.ppt
 
OpenStack Enabling DevOps
OpenStack Enabling DevOpsOpenStack Enabling DevOps
OpenStack Enabling DevOps
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 

More from Liming Zhu

AI Transformation A Clash with Human Expertise
AI TransformationA Clash with Human ExpertiseAI TransformationA Clash with Human Expertise
AI Transformation A Clash with Human ExpertiseLiming Zhu
 
Deciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AIDeciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AILiming Zhu
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AILiming Zhu
 
AI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersAI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersLiming Zhu
 
Software Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsSoftware Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsLiming Zhu
 
AI Transformation
AI TransformationAI Transformation
AI TransformationLiming Zhu
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfLiming Zhu
 
Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and DigitaltechLiming Zhu
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Liming Zhu
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AILiming Zhu
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...Liming Zhu
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsLiming Zhu
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Liming Zhu
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian ApproachLiming Zhu
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsLiming Zhu
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingLiming Zhu
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsLiming Zhu
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinLiming Zhu
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksLiming Zhu
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable OperationsLiming Zhu
 

More from Liming Zhu (20)

AI Transformation A Clash with Human Expertise
AI TransformationA Clash with Human ExpertiseAI TransformationA Clash with Human Expertise
AI Transformation A Clash with Human Expertise
 
Deciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AIDeciphering AI: Human Expertise in the Age of Evolving AI
Deciphering AI: Human Expertise in the Age of Evolving AI
 
GenAI in Research with Responsible AI
GenAI in Researchwith Responsible AIGenAI in Researchwith Responsible AI
GenAI in Research with Responsible AI
 
AI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future FrontiersAI Unveiled: From Current State to Future Frontiers
AI Unveiled: From Current State to Future Frontiers
 
Software Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based SystemsSoftware Architecture for Foundation Model-Based Systems
Software Architecture for Foundation Model-Based Systems
 
AI Transformation
AI TransformationAI Transformation
AI Transformation
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Trends & Innovation in Cyber and Digitaltech
Trends & Innovationin Cyber and DigitaltechTrends & Innovationin Cyber and Digitaltech
Trends & Innovation in Cyber and Digitaltech
 
Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models Responsible/Trustworthy AI in the Era of Foundation Models
Responsible/Trustworthy AI in the Era of Foundation Models
 
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AIICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
ICSE23 Keynote: Software Engineering as the Linchpin of Responsible AI
 
International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...International Cooperation for Research on Privacy and Data Protection - Austr...
International Cooperation for Research on Privacy and Data Protection - Austr...
 
RegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and LessonsRegTech for IR - Opportunities and Lessons
RegTech for IR - Opportunities and Lessons
 
Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61Emerging Technologies in Data Sharing and Analytics at Data61
Emerging Technologies in Data Sharing and Analytics at Data61
 
Responsible AI The Australian Approach
Responsible AIThe Australian ApproachResponsible AIThe Australian Approach
Responsible AI The Australian Approach
 
Distributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based SystemsDistributed Trust Architecture: The New Reality of ML-based Systems
Distributed Trust Architecture: The New Reality of ML-based Systems
 
Distributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of EverythingDistributed Trust Architecture: The New Foundation of Everything
Distributed Trust Architecture: The New Foundation of Everything
 
Cyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and SolutionsCyber technologies for SME growth – Barriers and Solutions
Cyber technologies for SME growth – Barriers and Solutions
 
Emerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital TwinEmerging Technologies in Synthetic Representation and Digital Twin
Emerging Technologies in Synthetic Representation and Digital Twin
 
Responsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risksResponsible AI & Cybersecurity: A tale of two technology risks
Responsible AI & Cybersecurity: A tale of two technology risks
 
Dependable Operations
Dependable OperationsDependable Operations
Dependable Operations
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Challenges in Practicing High Frequency Releases in Cloud Environments

  • 1. NICTA Copyright 2012 From imagination to impact Challenges in Practicing High Frequency Releases in Cloud Environments Liming Zhu, Donna Xu, Xiwei Xu, An Binh Tran, Ingo Weber, Len Bass NICTA/UNSW http://slideshare.net/limingzhu
  • 2. NICTA Copyright 2012 From imagination to impact NICTA (National ICT Australia) • Australia’s National Centre of Excellence in Information and Communication Technology • Five Research Labs: – ATP: Australian Technology Park, Sydney – NRL: UNSW, Sydney – CRL: ANU, Canberra – VRL: Uni. Melbourne – QRL: Uni. Queensland and QUT • 700 staff including 270 PhD students • Research Groups – Software Systems Research Group (SSRG) • ssrg.nicta.com.au – Machine Learning, Optimisation, Networks, Computer Vision
  • 3. NICTA Copyright 2012 From imagination to impact Challenge: High Frequency Releases/Changes • Significant shorter release cycles and DevOps – Continuous delivery/deployment • from months at scheduled downtime to hours at all times • Cloud uncertainty during provision/deployment – Heavy reliance on Cloud APIs; Indirect control – Other “sporadic” operations: cron jobs/backup/reconfig... – Our focus: error detection/diagnosis during continuous “changes” • Anomaly-detection/monitoring for normal operation not working • One solution: machine image as build artifacts? – Heavily-baked vs. lightly-baked? Immutable server?
  • 4. NICTA Copyright 2012 From imagination to impact Heavily-Baked vs. Lightly-Baked • Heavily-baked approach + No server drifts, consistent, more reliable? – Image preparation time for any minor release – Image sprawl – Image consistency among teams • coordination, golden image, image inheritance.. • Lightly-baked approach + Highly dynamic, config-as-service, less restarting… – Less reliable due to runtime dependence on external services (etc. repo, configuration services.. )? – Drifting, outcome validation, race conditions..
  • 5. NICTA Copyright 2012 From imagination to impact Motivating Example: Rolling Upgrade • Used in large-scale web operations – Have 100+ servers in cloud with version 1 software – Upgrade 10 servers at a time to version 2 software • Potentially take a long time to complete with errors during the operation – Provisioning failure, logical failures, instance failure – Other interfering operations • Heavily-baked vs. lightly-baked – Past experiences: Netflix Asgard with heavily-baked – AWS OpsWorks: • DevOps automation + life cycle events + abstraction • Heavily-baked + built-in recipe vs. lightly-baked + custom recipe
  • 6. NICTA Copyright 2012 From imagination to impact Observations 1/3
  • 7. NICTA Copyright 2012 From imagination to impact Observations 2/3
  • 8. NICTA Copyright 2012 From imagination to impact Observations 3/3
  • 9. NICTA Copyright 2012 From imagination to impact Solutions for Better Reliability/Predictability • Ad hoc tactics to reduce tails – Inspired by Jeff Dean’s “Tail at Scale” CACM article – Retry with alternative options • stop-restart, replace, deploy without restart – Fail fast • Tracking status time and 95 percentile to fail fast – Asynchronous waves for upgrading granularity >1 • Validate intermediary outcomes – Inside machine: • Chef Mini-test; test cases in production monitoring – Outside machine: • Process-Oriented Dependability (POD) • Assertion checking and conformance checking
  • 10. NICTA Copyright 2012 From imagination to impact Process-Oriented Dependability (POD) • Offline: treat operations as a processes – Process discovered automatically from logs/scripts • Log line clustering and process mining – Expected step outcomes specified as assertions • Online: use process context – Process context: process/instance/step ids, expected states – Errors are detected by examining logs and monitoring data • Assertions evaluation using monitoring facilities or directly • Compliance checking against expected processes – Detected errors are further diagnosed for (root) causes • Examining a fault tree to locate potential root causes • Performing more diagnostic tests and on-demand assertions X. Xu, L Zhu, et. al. "POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications,” 44nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014.
  • 11. NICTA Copyright 2012 From imagination to impact Example: Rolling Upgrade Using Asgard Read by Operator Process Mining Service Controls Outputs Create SnapshotCheck AZs Create instance from snapshot Create AMI from instance Evaluate AMI Discovered Model Asgard Log dataLog dataGenerates Offline Online Error Detection Service has two methods for detecting errors: • Assertion Checking • Conformance Checking
  • 12. NICTA Copyright 2012 From imagination to impact Summary • Lightly vs. heavily-baked for high frequency releases • Solutions for unreliable processes – Some tactics to reduce long tails • fail fast, alternative actions, asynchronous waves… – Validate intermediary outcomes • Inside machine: Chef Mini-test; test cases in production monitoring • Outside machine: Process-Oriented Dependability (POD) – Assertion checking and conformance checking • Currently integrating with monitoring and alerting • We need industry help and collaboration – Logs, trials, feedback, case study as book chapters Book: http://www.ssrg.nicta.com.au/projects/devops_book/ Contact: Liming.Zhu@nicta.com.au