SlideShare a Scribd company logo
1 of 19
Download to read offline
NICTA Copyright 2012 From imagination to impact
Automatic Undo for
Cloud Management via
AI Planning
Ingo Weber, Hiroshi Wada, Alan Fekete,
Anna Liu and Len Bass
Based on presentation at USENIX HotDep ‘12
NICTA Copyright 2012 From imagination to impact
NICTA in Brief
• Australia’s National Centre of Excellence in
Information and Communication Technology
• Five Research Labs:
– ATP: Australian Technology Park, Sydney
– NRL: UNSW, Sydney
– CRL: ANU, Canberra
– VRL: Uni. Melbourne
– QRL: Uni. Queensland and QUT
• 700 staff including 270 PhD students
• Budget: ~$90M/yr from Fed/State Gov and
industry
• ~600 research papers/year, ~150 patents total
NICTA Copyright 2012 From imagination to impact
our spin-out
3
NICTA Copyright 2012 From imagination to impact
Motivation of this work
• Yuruware Bolt: site-replication across regions
– Executes long-running operations - create, delete and
update variety resources via AWS API
• Issues we face
– High cost of writing unit tests
• Preparing a test bed, reset after each test, and error recovery
• “Why there is no DBUnit for cloud!”
– High cost of error handling
• Resources often get stuck in an unexpected state
• Well-coordinated and tailored clean up for each case
4
NICTA Copyright 2012 From imagination to impact
Our Goal
• Provide an “undo” button to cloud users
– Allow for rolling back to a previous state
– e.g., undelete deleted resources and reconstruct the
relations among resources
• We’re users - cannot alter API or implementation
– Some operations are undoable, e.g., detach a VIP
– Some operations are semi-undoable, e.g., stop a VM
– Some operations are not undoable, e.g., delete a disk
• Less invasive
– Minimum changes in existing code or scripts
5
NICTA Copyright 2012 From imagination to impact
Status quo
6
dodo
Operations
(API calls)
Administrator/script
Cloud resources
API calls
• Administrators and/or scripts talk to cloud via API
NICTA Copyright 2012 From imagination to impact
Goal
7
Checkpoint
Rollback
dodo
Operations
(API calls)
Administrator/script
Commit
Cloud resources
API calls
• Provide the ability to go back to a checkpoint
• No change in scripts except checkpoint and commit
NICTA Copyright 2012 From imagination to impact
Our approach
• “undo one by one in reverse order“ does not
always work
– No undo action is available
• e.g., no “undeleteing a deleted resouce “
– Undo requires a different sequence of API calls
• e.g., “deleteing an autoscaling group“ does not work
– Simple undo could results in a different state
• e.g., Undo “stopping an instance with a VIP“
– Undo operation may fail
• e.g., detaching a volume could fail. Need an alternative.
– Not optimal
• e.g., Bolt‘s operations (creating many temporary resouces)
• Tedious to hand-code for all possible cases!
8
NICTA Copyright 2012 From imagination to impact
Pseudo-delete for not-undoable ops
9
Checkpoint
Rollback
dodo
Operations
(API calls)
Administrator/script
Commit
API Wrapper
Logical state
(e.g., pseudo-
delete flags)
Cloud resources
Wrapped
API calls
API calls
Apply
changes
Execute
deletes
• Execute API calls if they are undoable
• Defer the execution of non-undoable calls until commit
NICTA Copyright 2012 From imagination to impact
AI planning for the rest of cases
10
Undo System
Checkpoint
Rollback
dodo
Operations
(API calls)
Sense cloud
resources
states
Sense cloud
resources
states
AI Planner
Administrator/script
Resource
state
(PDDL)
Generate
Input as
goal state
Commit
API Wrapper
Logical state
(e.g., pseudo-
delete flags)
Cloud resources
Wrapped
API calls
API calls
Apply
changes
Execute
deletes
Resource
state
(PDDL)
Input as
initial state
Generate
Domain
model
(PDDL)
Input
Generate
Compen-
sation plan
Compen-
sation
script
Code
generator
Execute
rollback
• Changes made by
(semi-)undoable API
calls are compensated
by an AI planner
• AI planner finds ways
to handle errors
potentially occur during
undo as well
NICTA Copyright 2012 From imagination to impact
AI Planning 101
• Given the initial state of the world, goal state,
and a set of available actions, find a sequence of
actions that leads from the initial to the goal
• Highly optimized heuristics tame the problem for
practical purposes
11
http://en.wikipedia.org/wiki/Tower_of_Hanoi
NICTA Copyright 2012 From imagination to impact
Planning under uncertainty
• In our problem
– Initial state: state at rollback
– Goal state: state at checkpoint
– Actions: AWS APIs
• We use FF [*] with an extension to handle
actions with alternative outcomes
• Finds “maximal“ contingency plans
– e.g., if detachnig a volume fails, stop the attached
instance if possible. If a planner cannot solve, ask
human intervention
12
[*] H OFFMANN , J., and N EBEL , B. “The FF planning system: Fast plan generation
through heuristic search.” Journal of Artificial Intelligence Research, 14 (2001),
NICTA Copyright 2012 From imagination to impact
Domain model: example
• Action to delete a disk volume in PDDL
(:action Delete-Volume
:parameters (?vol - tVolume)
:precondition
(and
(volumeAvailable ?vol)
(not (unrecoverableFailure ?vol)))
:effect
(oneof
(and
(deleted ?vol)
(not (volumeAvailable ?vol)))
(unrecoverableFailure ?vol)))
13
NICTA Copyright 2012 From imagination to impact
Domain model: actions
Resource type Actions
Virtual machine launch, terminate, start, stop, change
VM size
Disk volume create, delete, create-from-snapshot,
attach, detach
Disk snapshot create, delete
Elastic IP address allocate, release, associate, disassociate
Security group create, delete
Auto-scaling group create, delete, change-sizes, change-
launch-config
Auto-scaling launch config create, delete
Tag create, delete
Others AZ, cluster online/offline, ...
14
NICTA Copyright 2012 From imagination to impact
Evaluation
• Scalability of the planner based on an internally
released prototype
– AWS cmd line tool replacement
• Use cases: ~70 different planning settings tested
• Evaluation 1: against plan length
• Evaluation 2: against # of unrelated resources
15
NICTA Copyright 2012 From imagination to impact
Evaluation 1: vs plan length
16
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70
Seconds
Plan length
20 length is the maximum we need in our problem
Executing a plan with 10 steps takes ~145 sec
NICTA Copyright 2012 From imagination to impact
Evaluation 2: vs # of unrelated resources
17
0
50
100
150
200
250
300
350
400
450
500
35 350 3500
Seconds
Facts + Actions (in 1000s)
Basis: most difficult problem from previous slide
Planner‘s cost is small unless having 1000s of resouces
+ 500
unrelated
instances and
volumes
~50 related
resources
NICTA Copyright 2012 From imagination to impact
Conclusion & future work
• Rollback in cloud using AI planner
– Formalized part of AWS APIs in a planning domain
– Planning execution time is marginal for practical
system sizes
• Future work
– Extending checkpoints to capture internal resource
state
– Parallelizing plans
– Finding forward plans with constraints
18
NICTA Copyright 2012 From imagination to impact
Questions?
The paper is available at www.nicta.com.au/pub?doc=5994
19

More Related Content

What's hot

(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Trusted BIM: Accurate As-Builts for Project Coordination
Trusted BIM: Accurate As-Builts for Project CoordinationTrusted BIM: Accurate As-Builts for Project Coordination
Trusted BIM: Accurate As-Builts for Project CoordinationClearEdge3D Inc
 
Being Agile with Scrum - koders.co
Being Agile with Scrum - koders.coBeing Agile with Scrum - koders.co
Being Agile with Scrum - koders.coEnder Aydin Orak
 
Workflowsim escience12
Workflowsim escience12Workflowsim escience12
Workflowsim escience12Weiwei Chen
 
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps worldLucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps worldDevOps Enterprise Summit
 
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020Derek Ashmore
 
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Derek Ashmore
 
Capacity Planning with Free Tools
Capacity Planning with Free ToolsCapacity Planning with Free Tools
Capacity Planning with Free ToolsAdrian Cockcroft
 
Continuous Delivery with Spinnaker.io
Continuous Delivery with Spinnaker.ioContinuous Delivery with Spinnaker.io
Continuous Delivery with Spinnaker.ioMartin Roderus
 
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...Amazon Web Services
 
Anand Ahire - Electric Cloud - Visibility, Coordination, Control. Getting st...
Anand Ahire - Electric Cloud - Visibility, Coordination, Control.  Getting st...Anand Ahire - Electric Cloud - Visibility, Coordination, Control.  Getting st...
Anand Ahire - Electric Cloud - Visibility, Coordination, Control. Getting st...DevOps Enterprise Summit
 
Terraform best-practices-and-common-mistakes-dev ops-west-2021
Terraform best-practices-and-common-mistakes-dev ops-west-2021Terraform best-practices-and-common-mistakes-dev ops-west-2021
Terraform best-practices-and-common-mistakes-dev ops-west-2021Derek Ashmore
 
Shadowing production requests
Shadowing production requestsShadowing production requests
Shadowing production requestsJakauteri
 
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps Transformation
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps TransformationSteve Brodie - Electric Cloud - The Yin and Yang of DevOps Transformation
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps TransformationDevOps Enterprise Summit
 
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...DevOps Enterprise Summit
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
 
Top 10 Application Problems
Top 10 Application ProblemsTop 10 Application Problems
Top 10 Application ProblemsAppDynamics
 
Overhead Supercomputing 2011
Overhead Supercomputing 2011Overhead Supercomputing 2011
Overhead Supercomputing 2011Weiwei Chen
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Jeremy Edberg
 

What's hot (20)

(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Trusted BIM: Accurate As-Builts for Project Coordination
Trusted BIM: Accurate As-Builts for Project CoordinationTrusted BIM: Accurate As-Builts for Project Coordination
Trusted BIM: Accurate As-Builts for Project Coordination
 
Being Agile with Scrum - koders.co
Being Agile with Scrum - koders.coBeing Agile with Scrum - koders.co
Being Agile with Scrum - koders.co
 
Workflowsim escience12
Workflowsim escience12Workflowsim escience12
Workflowsim escience12
 
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps worldLucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
Lucas Gravley - HP - Self-Healing And Monitoring in a DevOps world
 
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020
AWS Lambda: Best Practices and Common Mistakes - Chicago Cloud Conference 2020
 
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
 
Capacity Planning with Free Tools
Capacity Planning with Free ToolsCapacity Planning with Free Tools
Capacity Planning with Free Tools
 
Continuous Delivery with Spinnaker.io
Continuous Delivery with Spinnaker.ioContinuous Delivery with Spinnaker.io
Continuous Delivery with Spinnaker.io
 
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...
Getting Cloudy with Remote Graphics and GPU Compute Using G2 instances (CPN21...
 
Anand Ahire - Electric Cloud - Visibility, Coordination, Control. Getting st...
Anand Ahire - Electric Cloud - Visibility, Coordination, Control.  Getting st...Anand Ahire - Electric Cloud - Visibility, Coordination, Control.  Getting st...
Anand Ahire - Electric Cloud - Visibility, Coordination, Control. Getting st...
 
Terraform best-practices-and-common-mistakes-dev ops-west-2021
Terraform best-practices-and-common-mistakes-dev ops-west-2021Terraform best-practices-and-common-mistakes-dev ops-west-2021
Terraform best-practices-and-common-mistakes-dev ops-west-2021
 
Shadowing production requests
Shadowing production requestsShadowing production requests
Shadowing production requests
 
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps Transformation
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps TransformationSteve Brodie - Electric Cloud - The Yin and Yang of DevOps Transformation
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps Transformation
 
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
Top 10 Application Problems
Top 10 Application ProblemsTop 10 Application Problems
Top 10 Application Problems
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Overhead Supercomputing 2011
Overhead Supercomputing 2011Overhead Supercomputing 2011
Overhead Supercomputing 2011
 
Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)Devops at Netflix (re:Invent)
Devops at Netflix (re:Invent)
 

Viewers also liked

Presentazione ruben finzi
Presentazione ruben finziPresentazione ruben finzi
Presentazione ruben finziruben finzi
 
Perfection on the water
Perfection on the waterPerfection on the water
Perfection on the waterHerman Pundt
 
Eliciting Operations Requirements for Applications
Eliciting Operations Requirements for ApplicationsEliciting Operations Requirements for Applications
Eliciting Operations Requirements for ApplicationsHiroshi Wada
 
Presentazione finzi
Presentazione finziPresentazione finzi
Presentazione finziruben finzi
 

Viewers also liked (7)

Presentazione ruben finzi
Presentazione ruben finziPresentazione ruben finzi
Presentazione ruben finzi
 
Perfection on the water
Perfection on the waterPerfection on the water
Perfection on the water
 
Eliciting Operations Requirements for Applications
Eliciting Operations Requirements for ApplicationsEliciting Operations Requirements for Applications
Eliciting Operations Requirements for Applications
 
Harga iklan jitu jawa pos
Harga iklan jitu jawa posHarga iklan jitu jawa pos
Harga iklan jitu jawa pos
 
Apa itu efsd
Apa itu efsdApa itu efsd
Apa itu efsd
 
Friends
FriendsFriends
Friends
 
Presentazione finzi
Presentazione finziPresentazione finzi
Presentazione finzi
 

Similar to Automatic Undo for Cloud Management via AI Planning

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...Liming Zhu
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Liming Zhu
 
Wireless Developing Wireless Monitoring and Control devices
Wireless Developing Wireless Monitoring and Control devicesWireless Developing Wireless Monitoring and Control devices
Wireless Developing Wireless Monitoring and Control devicesAidan Venn MSc
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital EnablementJoshua Gossett
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Liming Zhu
 
Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkDocker, Inc.
 
Considerations for operating docker at scale
Considerations for operating docker at scaleConsiderations for operating docker at scale
Considerations for operating docker at scaleDocker, Inc.
 
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld
 
3-Years-of-OpenStack-Intel-IT
3-Years-of-OpenStack-Intel-IT3-Years-of-OpenStack-Intel-IT
3-Years-of-OpenStack-Intel-ITTom Fieldhouse
 
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Andrejs Prokopjevs
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Skytap Cloud
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupYashrajNayak4
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allMarc Dutoo
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware
 
(java2days) Is the Future of Java Cloudy?
(java2days) Is the Future of Java Cloudy?(java2days) Is the Future of Java Cloudy?
(java2days) Is the Future of Java Cloudy?Steve Poole
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
 

Similar to Automatic Undo for Cloud Management via AI Planning (20)

POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments Challenges in Practicing High Frequency Releases in Cloud Environments
Challenges in Practicing High Frequency Releases in Cloud Environments
 
Wireless Developing Wireless Monitoring and Control devices
Wireless Developing Wireless Monitoring and Control devicesWireless Developing Wireless Monitoring and Control devices
Wireless Developing Wireless Monitoring and Control devices
 
Planning open stack-poc
Planning open stack-pocPlanning open stack-poc
Planning open stack-poc
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the Enterprise
 
Accelerating Digital Transformation: It's About Digital Enablement
Accelerating Digital Transformation:  It's About Digital EnablementAccelerating Digital Transformation:  It's About Digital Enablement
Accelerating Digital Transformation: It's About Digital Enablement
 
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operation - Performance Management and Capacity Planning Under Con...
 
Using Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at SplunkUsing Docker EE to Scale Operational Intelligence at Splunk
Using Docker EE to Scale Operational Intelligence at Splunk
 
Considerations for operating docker at scale
Considerations for operating docker at scaleConsiderations for operating docker at scale
Considerations for operating docker at scale
 
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
VMworld 2013: Moving Enterprise Application Dev/Test to VMware’s Internal Pri...
 
3-Years-of-OpenStack-Intel-IT
3-Years-of-OpenStack-Intel-IT3-Years-of-OpenStack-Intel-IT
3-Years-of-OpenStack-Intel-IT
 
NoSQL and ACID
NoSQL and ACIDNoSQL and ACID
NoSQL and ACID
 
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
 
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
Webinar: Build Better Software: Removing the Constraints Slowing Dev/Test Tea...
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
 
(java2days) Is the Future of Java Cloudy?
(java2days) Is the Future of Java Cloudy?(java2days) Is the Future of Java Cloudy?
(java2days) Is the Future of Java Cloudy?
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 

Automatic Undo for Cloud Management via AI Planning

  • 1. NICTA Copyright 2012 From imagination to impact Automatic Undo for Cloud Management via AI Planning Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu and Len Bass Based on presentation at USENIX HotDep ‘12
  • 2. NICTA Copyright 2012 From imagination to impact NICTA in Brief • Australia’s National Centre of Excellence in Information and Communication Technology • Five Research Labs: – ATP: Australian Technology Park, Sydney – NRL: UNSW, Sydney – CRL: ANU, Canberra – VRL: Uni. Melbourne – QRL: Uni. Queensland and QUT • 700 staff including 270 PhD students • Budget: ~$90M/yr from Fed/State Gov and industry • ~600 research papers/year, ~150 patents total
  • 3. NICTA Copyright 2012 From imagination to impact our spin-out 3
  • 4. NICTA Copyright 2012 From imagination to impact Motivation of this work • Yuruware Bolt: site-replication across regions – Executes long-running operations - create, delete and update variety resources via AWS API • Issues we face – High cost of writing unit tests • Preparing a test bed, reset after each test, and error recovery • “Why there is no DBUnit for cloud!” – High cost of error handling • Resources often get stuck in an unexpected state • Well-coordinated and tailored clean up for each case 4
  • 5. NICTA Copyright 2012 From imagination to impact Our Goal • Provide an “undo” button to cloud users – Allow for rolling back to a previous state – e.g., undelete deleted resources and reconstruct the relations among resources • We’re users - cannot alter API or implementation – Some operations are undoable, e.g., detach a VIP – Some operations are semi-undoable, e.g., stop a VM – Some operations are not undoable, e.g., delete a disk • Less invasive – Minimum changes in existing code or scripts 5
  • 6. NICTA Copyright 2012 From imagination to impact Status quo 6 dodo Operations (API calls) Administrator/script Cloud resources API calls • Administrators and/or scripts talk to cloud via API
  • 7. NICTA Copyright 2012 From imagination to impact Goal 7 Checkpoint Rollback dodo Operations (API calls) Administrator/script Commit Cloud resources API calls • Provide the ability to go back to a checkpoint • No change in scripts except checkpoint and commit
  • 8. NICTA Copyright 2012 From imagination to impact Our approach • “undo one by one in reverse order“ does not always work – No undo action is available • e.g., no “undeleteing a deleted resouce “ – Undo requires a different sequence of API calls • e.g., “deleteing an autoscaling group“ does not work – Simple undo could results in a different state • e.g., Undo “stopping an instance with a VIP“ – Undo operation may fail • e.g., detaching a volume could fail. Need an alternative. – Not optimal • e.g., Bolt‘s operations (creating many temporary resouces) • Tedious to hand-code for all possible cases! 8
  • 9. NICTA Copyright 2012 From imagination to impact Pseudo-delete for not-undoable ops 9 Checkpoint Rollback dodo Operations (API calls) Administrator/script Commit API Wrapper Logical state (e.g., pseudo- delete flags) Cloud resources Wrapped API calls API calls Apply changes Execute deletes • Execute API calls if they are undoable • Defer the execution of non-undoable calls until commit
  • 10. NICTA Copyright 2012 From imagination to impact AI planning for the rest of cases 10 Undo System Checkpoint Rollback dodo Operations (API calls) Sense cloud resources states Sense cloud resources states AI Planner Administrator/script Resource state (PDDL) Generate Input as goal state Commit API Wrapper Logical state (e.g., pseudo- delete flags) Cloud resources Wrapped API calls API calls Apply changes Execute deletes Resource state (PDDL) Input as initial state Generate Domain model (PDDL) Input Generate Compen- sation plan Compen- sation script Code generator Execute rollback • Changes made by (semi-)undoable API calls are compensated by an AI planner • AI planner finds ways to handle errors potentially occur during undo as well
  • 11. NICTA Copyright 2012 From imagination to impact AI Planning 101 • Given the initial state of the world, goal state, and a set of available actions, find a sequence of actions that leads from the initial to the goal • Highly optimized heuristics tame the problem for practical purposes 11 http://en.wikipedia.org/wiki/Tower_of_Hanoi
  • 12. NICTA Copyright 2012 From imagination to impact Planning under uncertainty • In our problem – Initial state: state at rollback – Goal state: state at checkpoint – Actions: AWS APIs • We use FF [*] with an extension to handle actions with alternative outcomes • Finds “maximal“ contingency plans – e.g., if detachnig a volume fails, stop the attached instance if possible. If a planner cannot solve, ask human intervention 12 [*] H OFFMANN , J., and N EBEL , B. “The FF planning system: Fast plan generation through heuristic search.” Journal of Artificial Intelligence Research, 14 (2001),
  • 13. NICTA Copyright 2012 From imagination to impact Domain model: example • Action to delete a disk volume in PDDL (:action Delete-Volume :parameters (?vol - tVolume) :precondition (and (volumeAvailable ?vol) (not (unrecoverableFailure ?vol))) :effect (oneof (and (deleted ?vol) (not (volumeAvailable ?vol))) (unrecoverableFailure ?vol))) 13
  • 14. NICTA Copyright 2012 From imagination to impact Domain model: actions Resource type Actions Virtual machine launch, terminate, start, stop, change VM size Disk volume create, delete, create-from-snapshot, attach, detach Disk snapshot create, delete Elastic IP address allocate, release, associate, disassociate Security group create, delete Auto-scaling group create, delete, change-sizes, change- launch-config Auto-scaling launch config create, delete Tag create, delete Others AZ, cluster online/offline, ... 14
  • 15. NICTA Copyright 2012 From imagination to impact Evaluation • Scalability of the planner based on an internally released prototype – AWS cmd line tool replacement • Use cases: ~70 different planning settings tested • Evaluation 1: against plan length • Evaluation 2: against # of unrelated resources 15
  • 16. NICTA Copyright 2012 From imagination to impact Evaluation 1: vs plan length 16 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 Seconds Plan length 20 length is the maximum we need in our problem Executing a plan with 10 steps takes ~145 sec
  • 17. NICTA Copyright 2012 From imagination to impact Evaluation 2: vs # of unrelated resources 17 0 50 100 150 200 250 300 350 400 450 500 35 350 3500 Seconds Facts + Actions (in 1000s) Basis: most difficult problem from previous slide Planner‘s cost is small unless having 1000s of resouces + 500 unrelated instances and volumes ~50 related resources
  • 18. NICTA Copyright 2012 From imagination to impact Conclusion & future work • Rollback in cloud using AI planner – Formalized part of AWS APIs in a planning domain – Planning execution time is marginal for practical system sizes • Future work – Extending checkpoints to capture internal resource state – Parallelizing plans – Finding forward plans with constraints 18
  • 19. NICTA Copyright 2012 From imagination to impact Questions? The paper is available at www.nicta.com.au/pub?doc=5994 19