Automatic Undo for Cloud Management via AI Planning

  • 65 views
Uploaded on

presented at Berkeley AMPLab Seminar on Oct 17, 2012 …

presented at Berkeley AMPLab Seminar on Oct 17, 2012
by Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, Len Bass

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
65
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NICTA Copyright 2012 From imagination to impactAutomatic Undo forCloud Management viaAI PlanningIngo Weber, Hiroshi Wada, Alan Fekete,Anna Liu and Len BassBased on presentation at USENIX HotDep ‘12
  • 2. NICTA Copyright 2012 From imagination to impactNICTA in Brief• Australia’s National Centre of Excellence inInformation and Communication Technology• Five Research Labs:– ATP: Australian Technology Park, Sydney– NRL: UNSW, Sydney– CRL: ANU, Canberra– VRL: Uni. Melbourne– QRL: Uni. Queensland and QUT• 700 staff including 270 PhD students• Budget: ~$90M/yr from Fed/State Gov andindustry• ~600 research papers/year, ~150 patents total
  • 3. NICTA Copyright 2012 From imagination to impactour spin-out3
  • 4. NICTA Copyright 2012 From imagination to impactMotivation of this work• Yuruware Bolt: site-replication across regions– Executes long-running operations - create, delete andupdate variety resources via AWS API• Issues we face– High cost of writing unit tests• Preparing a test bed, reset after each test, and error recovery• “Why there is no DBUnit for cloud!”– High cost of error handling• Resources often get stuck in an unexpected state• Well-coordinated and tailored clean up for each case4
  • 5. NICTA Copyright 2012 From imagination to impactOur Goal• Provide an “undo” button to cloud users– Allow for rolling back to a previous state– e.g., undelete deleted resources and reconstruct therelations among resources• We’re users - cannot alter API or implementation– Some operations are undoable, e.g., detach a VIP– Some operations are semi-undoable, e.g., stop a VM– Some operations are not undoable, e.g., delete a disk• Less invasive– Minimum changes in existing code or scripts5
  • 6. NICTA Copyright 2012 From imagination to impactStatus quo6dodoOperations(API calls)Administrator/scriptCloud resourcesAPI calls• Administrators and/or scripts talk to cloud via API
  • 7. NICTA Copyright 2012 From imagination to impactGoal7CheckpointRollbackdodoOperations(API calls)Administrator/scriptCommitCloud resourcesAPI calls• Provide the ability to go back to a checkpoint• No change in scripts except checkpoint and commit
  • 8. NICTA Copyright 2012 From imagination to impactOur approach• “undo one by one in reverse order“ does notalways work– No undo action is available• e.g., no “undeleteing a deleted resouce “– Undo requires a different sequence of API calls• e.g., “deleteing an autoscaling group“ does not work– Simple undo could results in a different state• e.g., Undo “stopping an instance with a VIP“– Undo operation may fail• e.g., detaching a volume could fail. Need an alternative.– Not optimal• e.g., Bolt‘s operations (creating many temporary resouces)• Tedious to hand-code for all possible cases!8
  • 9. NICTA Copyright 2012 From imagination to impactPseudo-delete for not-undoable ops9CheckpointRollbackdodoOperations(API calls)Administrator/scriptCommitAPI WrapperLogical state(e.g., pseudo-delete flags)Cloud resourcesWrappedAPI callsAPI callsApplychangesExecutedeletes• Execute API calls if they are undoable• Defer the execution of non-undoable calls until commit
  • 10. NICTA Copyright 2012 From imagination to impactAI planning for the rest of cases10Undo SystemCheckpointRollbackdodoOperations(API calls)Sense cloudresourcesstatesSense cloudresourcesstatesAI PlannerAdministrator/scriptResourcestate(PDDL)GenerateInput asgoal stateCommitAPI WrapperLogical state(e.g., pseudo-delete flags)Cloud resourcesWrappedAPI callsAPI callsApplychangesExecutedeletesResourcestate(PDDL)Input asinitial stateGenerateDomainmodel(PDDL)InputGenerateCompen-sation planCompen-sationscriptCodegeneratorExecuterollback• Changes made by(semi-)undoable APIcalls are compensatedby an AI planner• AI planner finds waysto handle errorspotentially occur duringundo as well
  • 11. NICTA Copyright 2012 From imagination to impactAI Planning 101• Given the initial state of the world, goal state,and a set of available actions, find a sequence ofactions that leads from the initial to the goal• Highly optimized heuristics tame the problem forpractical purposes11http://en.wikipedia.org/wiki/Tower_of_Hanoi
  • 12. NICTA Copyright 2012 From imagination to impactPlanning under uncertainty• In our problem– Initial state: state at rollback– Goal state: state at checkpoint– Actions: AWS APIs• We use FF [*] with an extension to handleactions with alternative outcomes• Finds “maximal“ contingency plans– e.g., if detachnig a volume fails, stop the attachedinstance if possible. If a planner cannot solve, askhuman intervention12[*] H OFFMANN , J., and N EBEL , B. “The FF planning system: Fast plan generationthrough heuristic search.” Journal of Artificial Intelligence Research, 14 (2001),
  • 13. NICTA Copyright 2012 From imagination to impactDomain model: example• Action to delete a disk volume in PDDL(:action Delete-Volume:parameters (?vol - tVolume):precondition(and(volumeAvailable ?vol)(not (unrecoverableFailure ?vol))):effect(oneof(and(deleted ?vol)(not (volumeAvailable ?vol)))(unrecoverableFailure ?vol)))13
  • 14. NICTA Copyright 2012 From imagination to impactDomain model: actionsResource type ActionsVirtual machine launch, terminate, start, stop, changeVM sizeDisk volume create, delete, create-from-snapshot,attach, detachDisk snapshot create, deleteElastic IP address allocate, release, associate, disassociateSecurity group create, deleteAuto-scaling group create, delete, change-sizes, change-launch-configAuto-scaling launch config create, deleteTag create, deleteOthers AZ, cluster online/offline, ...14
  • 15. NICTA Copyright 2012 From imagination to impactEvaluation• Scalability of the planner based on an internallyreleased prototype– AWS cmd line tool replacement• Use cases: ~70 different planning settings tested• Evaluation 1: against plan length• Evaluation 2: against # of unrelated resources15
  • 16. NICTA Copyright 2012 From imagination to impactEvaluation 1: vs plan length1600.511.522.50 10 20 30 40 50 60 70SecondsPlan length20 length is the maximum we need in our problemExecuting a plan with 10 steps takes ~145 sec
  • 17. NICTA Copyright 2012 From imagination to impactEvaluation 2: vs # of unrelated resources1705010015020025030035040045050035 350 3500SecondsFacts + Actions (in 1000s)Basis: most difficult problem from previous slidePlanner‘s cost is small unless having 1000s of resouces+ 500unrelatedinstances andvolumes~50 relatedresources
  • 18. NICTA Copyright 2012 From imagination to impactConclusion & future work• Rollback in cloud using AI planner– Formalized part of AWS APIs in a planning domain– Planning execution time is marginal for practicalsystem sizes• Future work– Extending checkpoints to capture internal resourcestate– Parallelizing plans– Finding forward plans with constraints18
  • 19. NICTA Copyright 2012 From imagination to impactQuestions?The paper is available at www.nicta.com.au/pub?doc=599419