Supporting operations personnel a software engineers perspective


Published on

A survey of the operations domain and what can go wrong with operations activites

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Supporting operations personnel a software engineers perspective

  1. 1. NICTA Copyright 2012 From imagination to impactSupporting OperationsPersonnel: A SoftwareEngineeringPerspectiveLen Bass
  2. 2. NICTA Copyright 2012 From imagination to impact2About NICTANational ICT Australia• Federal and state funded researchcompany established in 2002• Largest ICT research resource inAustralia• National impact is an importantsuccess metric• ~700 staff/students working in 5 labsacross major capital cities• 7 university partners• Providing R&D services, knowledgetransfer to Australian (and global) ICTindustryNICTA technology isin over 1 billion mobilephones
  3. 3. NICTA Copyright 2012 From imagination to impactTraditional View from Software Engineers3ApplicationCloudEnvironmentTraditionally, the software engineering communityhas viewed systems as being developed for usersand existing in an environment. The motivatingquestions have been: With this world view: how candevelopment costs be reduced and run time qualityimproved?End usersDevelopers
  4. 4. NICTA Copyright 2012 From imagination to impactA Broader View4ApplicationCloudEnvironmentApplications are not only affected by the behavior of theend users but also by actions of operators who controlthe environment for a consumer’s application.ConsumerOperatorEnd usersDevelopers
  5. 5. NICTA Copyright 2012 From imagination to impactMy Message: Consider the Operator in thisPicture5ApplicationCloudEnvironmentConsumerOperatorEnd usersDevelopersComputer operations is a domain that impacts everyapplication that operates in an enterprise environment. Assuch, Software Engineers need to be aware of howactions of operators can affect their application and howactions of their application can simplify life for operators..
  6. 6. NICTA Copyright 2012 From imagination to impactBusiness Context“Through 2015, 80% of outages impacting mission-criticalservices will be caused by people and process issues, andmore than 50% of those outages will be caused bychange/configuration/release integration and hand-offissues.”Change/configuration/release integration and hand off areall operations issues.Gartner -"I&O [Infrastructure and operations] representsapproximately 60 percent of total IT spending worldwide, "
  7. 7. NICTA Copyright 2012 From imagination to impactOutline• Overview of operations domain– What do operators do?– What can go wrong with what they do?• Some results NICTA has achieved or activitieswe have ongoing7
  8. 8. NICTA Copyright 2012 From imagination to impactWhat Do Operators Do?8Akamai’s NOC in Cambridge, Massachusetts• Monitor and control data center/network/systemactivity– Install new/upgradedapplications/middleware/configurations/hardware• Support business continuity through back upsand disaster recovery
  9. 9. NICTA Copyright 2012 From imagination to impactMonitor and Control• Data Center– Total number and type of resources (may be virtual)• Processors• Storage• Network• Network– Intrusion detection– Routing– Loading• System– Allocation to resources– Install/uninstall– Configure 9
  10. 10. NICTA Copyright 2012 From imagination to impactWhat can go Wrong with Monitor andControl?Everything that was on previous slide.• Failure• Installations can fail• Resources fail and must be replaced• Overload– Resources are over/under loaded and must besupplemented/removed– Networks get overloaded and routing must be changed• Error– Routing may be incorrectly specified– Allocation of systems to resources may be incorrect– Configurations can be incorrectly specified10
  11. 11. NICTA Copyright 2012 From imagination to impactInstall New/Upgraded Applications• Specifying configuration for applications• Synchronizing state for upgraded applications• Testing new/upgraded applications in targetenvironment• Allocating resources for new version11
  12. 12. NICTA Copyright 2012 From imagination to impactWhat Can go Wrong with Installation?• Again its everything.– Configuration can be misspecified– Cut over to new version may leave inconsistent state– Upgrade to level N of the stack may break software inlevel >N of the stack– Testing environment may not appropriately mirror realenvironment– Configuration of one level of the stack may beinconsistent with requirements of another level.12
  13. 13. NICTA Copyright 2012 From imagination to impactSupporting Business Continuity• Disasters happen – natural or human causes• Backing up data provides recovery possibility– Lag between last version backed up and whendisaster happens– In the Cloud, backing up large amounts of data todifferent geographic regions takes time.13
  14. 14. NICTA Copyright 2012 From imagination to impactHand Offs• Problems can arise when a shift changes– What problems did old shift deal with?– What problems were totally solved?– What problems were partially solved?– What operations activities are currently ongoing?14
  15. 15. NICTA Copyright 2012 From imagination to impactOperations is a Target Rich Environment• There are many existing tools. Operation of datacenters would not work without tools• Much room for improvement (see Gartner quote)• Some general approaches for improvement– Make software systems operations and tools process andincident aware. E.g. make them aware of upgrade or shiftchange– Model operations processes and systems using a single model.• Model analysis will provide opportunities for detecting trade offs betweenhuman and automated activities.• Model might also enable smoother error detection15
  16. 16. NICTA Copyright 2012 From imagination to impactOutline• Overview of operations domain• Some results we have achieved or activitieswe have ongoing– Disaster Recovery product– Upgrade– Operator undo– Installation process.16
  17. 17. NICTA Copyright 2012 From imagination to impactDisaster Recovery• Clouds fail – Amazon had three outages in 2011that affected whole availability zones or regions.• NICTA has a subsidiary (Yuruware) with a non-intrusive disaster recovery product (Bolt).• Bolt copies data periodically to a back up region.• Bolt utilizes sophisticated data movementtechniques to reduce time required to back up• This is an insurance policy.17
  18. 18. NICTA Copyright 2012 From imagination to impactNext Problem – Upgrade• Upgrades are a very common occurrence• Upgrade frequency of some common systems• Some systems have multiple releases per day,driven by developers – continuous deployment18Application Average release intervalFacebook (platform) < 7 daysGoogle Docs <50 daysMedia Wiki 21 daysJoomla 30 days
  19. 19. NICTA Copyright 2012 From imagination to impactVarious Upgrade Strategies• How many at once?– One at a time (rolling upgrade)– Groups at a time (staged upgrade, e.g. canaries. Thisis using production environment for testing)– All at once (big flip)• How long are new versions tested to determinecorrectness?– Period based – for some period of time– Load based – under some utilization assumptions• What happens to old versions?– Replaced en masse– Maintained for some period for compatibility purposes19
  20. 20. NICTA Copyright 2012 From imagination to impactHaving Multiple Versions SimulaneouslyActive May Lead to Mixed Version RaceCondition20Server 2 (newversion34X ERRORInitial requestClient (browser)Server 1 (oldversion125Start rollingupgradeHTTP reply withembedded JavaScriptAJAX callback
  21. 21. NICTA Copyright 2012 From imagination to impactOne Method for Preventing Mixed Version Race Conditionis to Make Load Balancers Version AwareClient mayrequestparticular versionof a serviceExternal facingRouter (wrt tocloud)Internal RouterServer forVersion AServer forVersion AServer forVersion BInternal RouterServer forVersion AServer forVersion B21At each level of the routing hierarchythere are two possibilities for eachrequest• Request is neutral with respect toversion• Request specifies versionRouting must• Be fast to ensure rapid response• Satisfy “goodness” criteria forscheduling• Conform to client request wrtversion.In addition:• Servers are beingupgraded to a laterversion while servicingclient requests• Load variation maytrigger elasticity rules
  22. 22. NICTA Copyright 2012 From imagination to impactWhat is Criterion for Measuring LoadBalancer Scheduling?• What is “goodness” with respect to routingdecisions within the constraints of schedulingstrategy and version awareness?– Uniform distribution of requests?– Keeping utilization within bounds?– Utilizing wide variety of clients?– Other?• Main result so far. Version awareness isincompatible with any of the above “goodness”criteria for the staged upgrade strategy.22
  23. 23. NICTA Copyright 2012 From imagination to impactCanary or Staged Strategy• Upgrade one or several servers to new versionand leave them for some time.• Formulation:– Staged upgrade• M version A servers (constant number)• N version B servers (constant number)• Fixed number of clients– Version aware• Once a client has had a request serviced by a version Bserver it cannot subsequently have any requests serviced byany version A server.23
  24. 24. NICTA Copyright 2012 From imagination to impactBifurcation of Clients• Clients are bifurcated into version A clients andversion B clients after some time– Intuitively, for each client, either it is serviced by aserver with version B and consequently never servedby any server with version A or never served by aserver with version B. So each client ends in up theServer A class or the Server B class but not both.• We call clients that end up being serviced byservices with version A, class A clients.Similarly, for class B clients.• Allowing additional clients does notfundamentally change result.24
  25. 25. NICTA Copyright 2012 From imagination to impactBifurcation of Clients Implies• Cannot control for utilization unless create new instancesof version B in response to demand– There are a fixed number of clients sending requests to a fixednumber of servers with version B. Cannot vary the number ofservers to reflect the load generated by the fixed set of clients.Consequently cannot control the utilization by servers withversion B.• Cannot control for uniform distribution.– Uniform distribution means that every request has an equalchange of being sent to any server. If a client is in class A, then ithas 0% chance of being sent to a server with Version B.• Difficult to control for wide variety of clients.– Variations among the clients must be mirrored within class A andclass B clients since the classes are fixed after the bifurcation.This is difficult to accomplish since types of variations that areimportant are usually not known.25
  26. 26. NICTA Copyright 2012 From imagination to impactQuestions to Answer.– How long does it take to reach bifurcated state underwhat assumptions?– How can the goals of staged upgrade be achievedwithin the constraints of version awareness?26
  27. 27. NICTA Copyright 2012 From imagination to impactNext Problem• Operators use scripts to perform actions such asupdate• Scripts may fail– May be result of API failure (more on this later)– May be desire to set up testing environment– May be result of failure of underlying virtual machine.• When a script fails, the operator may wish toreturn to a known state (undo severaloperations)27
  28. 28. NICTA Copyright 2012 From imagination to impactOperator Undo• Not always that straight-forward:– Attaching volume is no problem while the instance isrunning, detaching might be problematic– Creating / changing auto-scaling rules has effect onnumber of running instances• Cannot terminate additional instances, as the rule wouldcreate new ones!– Deleted / terminated / released resources are gone!28
  29. 29. NICTA Copyright 2012 From imagination to impactUndo for System Operators29+ commit+ pseudo-deletebegin-transactionrollbackdododoAdministrator
  30. 30. NICTA Copyright 2012 From imagination to impactApproach30begin-transactionrollbackdododoSense cloudresources statesSense cloudresources statesAdministratorUndo System
  31. 31. NICTA Copyright 2012 From imagination to impactApproach31begin-transactionrollbackdododoSense cloudresources statesSense cloudresources statesAdministratorUndo SystemGoalstateGoalstateInitialstateInitialstate
  32. 32. NICTA Copyright 2012 From imagination to impactbegin-transactionrollbackdododoSense cloudresources statesSense cloudresources statesPlanGenerate codeExecuteAdministratorUndo SystemGoalstateGoalstateInitialstateInitialstateSet ofactionsSet ofactionsApproach32
  33. 33. NICTA Copyright 2012 From imagination to impactWhat about API Failures?• Operator scripts make heavy use of checking orcontrolling state of resources– Start/stop VM– Is VM active?• These scripts becomes calls to the cloudprovider’s API.• Calls may fail– Underlying VM has failed– Eventual consistency.33
  34. 34. NICTA Copyright 2012 From imagination to impactWe Have Performed an Empirical Study ofAPI Failures in EC2• 922 cases out of 1109 reported API-relatedcases in the EC2 forum from 2010 to 2012 areAPI failures (rather than feature requests orgeneral inquiries).• We classified the extracted API failures into fourtypes of failures:– content failures,– late timing failures,– halt failures, and– erratic failures.34
  35. 35. NICTA Copyright 2012 From imagination to impactResults• A majority (60%) of the cases of API failure are related tostuck API calls or unresponsive API calls.• A large portion (12%) of the cases are about slowresponsive API calls.• 19% of the cases are related to the output issues of APIcalls, including failed calls with unclear errormessages, as well as missing output, wrong output, andunexpected output of API calls.• 9% of the cases reported that their calls were pendingfor a certain time and then returned to the original statewithout informing the caller properly or the calls werereported to be successful first but failed later.35
  36. 36. NICTA Copyright 2012 From imagination to impactNext Problem - Operations Processes• We are looking at the process of installing newsoftware– Error Prone– Potential process improvements.36
  37. 37. NICTA Copyright 2012 From imagination to impactMotivating Scenario• You change the operating environment for anapplication– Configuration change– Version change– Hardware change• Result is degraded performance• When the software stack is deep with portionsfrom different suppliers, the result is frequently:37
  38. 38. NICTA Copyright 2012 From imagination to impactWhy is Installation Error Prone?• Installation is complicated.– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g forLinux are ~250 pages each– Apache description of addresses and ports (one out of 16descriptions) has following elements:• Choosing and specifying ports for the server to listen to• IPv4 and IPv6• Protocols• Virtual Hosts– The number of configuration options that must be set can belarge• Hadoop has 206 options• HBase has 64– Many dependencies are not visible until execution38
  39. 39. NICTA Copyright 2012 From imagination to impactInstallation Processes• Processes may be– Undocumented– Out of date– Insufficiently detailed• Our goal is to build process model includingerror recovery mechanisms39
  40. 40. NICTA Copyright 2012 From imagination to impactOur Activities40• Create up to date process models for installationprocesses. Information sources are– Process discovery from logs– Process formalization from existing writtendescriptions.• Process descriptions can be used to– Make trade offs– Make recommendations in real time to operationsstaff– Recommend setting checkpoints for potential laterundo, before a risky part of a process is entered– Assist in the detection of errors
  41. 41. NICTA Copyright 2012 From imagination to impactHard Problems41• Creating accurate process models– Exception handling mechanisms are not welldocumented– Labor intensive.– Our approach• Top down modeling using process modeling formalism• Bottom up process mining from error logs• Diagnosing errors
  42. 42. NICTA Copyright 2012 From imagination to impactWhy is Error Diagnosis Hard?In a distributed computingenvironment, when an erroroccurs during operations, it isdifficult and time consuming todiagnosis it.Diagnosis involves correlatingmessages from• different distributed servers• different portions of thesoftware stackand determining the rootcause of the error.The root cause, in turn, maybe within a portion of the stackthat is different from where theerror is observed.
  43. 43. NICTA Copyright 2012 From imagination to impactTest Bed43Our current test bed is the Hbase stack
  44. 44. NICTA Copyright 2012 From imagination to impactCurrently Performing Analysis ofConfiguration Errors44• Cross stack errors may take hours to diagnose– Log files are inconsistent– Error message may not give context necessary todetermine root cause.
  45. 45. NICTA Copyright 2012 From imagination to impactWhere to Find Information about OperationsDomain?• Every open source program requires a variety ofconfiguration parameters.• Every modern application depends on a varietyof middleware so cross domain examples shouldbe readily available.• Most organizations have extensive processes fortheir operations personnel. Use these processesas a framework for investigating process/productinteractions.45
  46. 46. NICTA Copyright 2012 From imagination to impactSummary• Operations problems will account for the majorityof outages and IT costs in the next severalyears.• The operations space is a rich source ofresearch problems that has been insufficientlymined.• Best way to determine what problems to attackis to monitor or interview operators46
  47. 47. NICTA Copyright 2012 From imagination to impactNICTA Team• Anna Liu• Alan Fekete• Min Fu• Jim Zhanwen Li• Qinghua Lu• Hiroshi Wada• Ingo Weber• Xiwei Xu• Liming Zhu47