Supporting operations personnel a software engineers perspective

NICTA Copyright 2012 From imagination to impact
Supporting Operations
Personnel: A Software
Engineering
Perspective
Len Bass

2
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry
NICTA technology is
in over 1 billion mobile
phones

Traditional View from Software Engineers
3
Application
Cloud
Environment
Traditionally, the software engineering community
has viewed systems as being developed for users
and existing in an environment. The motivating
questions have been: With this world view: how can
development costs be reduced and run time quality
improved?
End users
Developers

A Broader View
4
Application
Cloud
Environment
Applications are not only affected by the behavior of the
end users but also by actions of operators who control
the environment for a consumer’s application.
Consumer
Operator
End users
Developers

My Message: Consider the Operator in this
Picture
5
Application
Cloud
Environment
Consumer
Operator
End users
Developers
Computer operations is a domain that impacts every
application that operates in an enterprise environment. As
such, Software Engineers need to be aware of how
actions of operators can affect their application and how
actions of their application can simplify life for operators..

Business Context
“Through 2015, 80% of outages impacting mission-critical
services will be caused by people and process issues, and
more than 50% of those outages will be caused by
change/configuration/release integration and hand-off
issues.”
Change/configuration/release integration and hand off are
all operations issues.
Gartner - http://www.rbiassets.com/getfile.ashx/42112626510
"I&O [Infrastructure and operations] represents
approximately 60 percent of total IT spending worldwide, "
http://www.gartner.com/it/page.jsp?id=1807615
6

Outline
• Overview of operations domain
– What do operators do?
– What can go wrong with what they do?
• Some results NICTA has achieved or activities
we have ongoing
7

What Do Operators Do?
8
Akamai’s NOC in Cambridge, Massachusetts
• Monitor and control data center/network/system
activity
– Install new/upgraded
applications/middleware/configurations/hardware
• Support business continuity through back ups
and disaster recovery

Monitor and Control
• Data Center
– Total number and type of resources (may be virtual)
• Processors
• Storage
• Network
• Network
– Intrusion detection
– Routing
– Loading
• System
– Allocation to resources
– Install/uninstall
– Configure 9

What can go Wrong with Monitor and
Control?
Everything that was on previous slide.
• Failure
• Installations can fail
• Resources fail and must be replaced
• Overload
– Resources are over/under loaded and must be
supplemented/removed
– Networks get overloaded and routing must be changed
• Error
– Routing may be incorrectly specified
– Allocation of systems to resources may be incorrect
– Configurations can be incorrectly specified
10

Install New/Upgraded Applications
• Specifying configuration for applications
• Synchronizing state for upgraded applications
• Testing new/upgraded applications in target
environment
• Allocating resources for new version
11

What Can go Wrong with Installation?
• Again its everything.
– Configuration can be misspecified
– Cut over to new version may leave inconsistent state
– Upgrade to level N of the stack may break software in
level >N of the stack
– Testing environment may not appropriately mirror real
environment
– Configuration of one level of the stack may be
inconsistent with requirements of another level.
12

Supporting Business Continuity
• Disasters happen – natural or human causes
• Backing up data provides recovery possibility
– Lag between last version backed up and when
disaster happens
– In the Cloud, backing up large amounts of data to
different geographic regions takes time.
13

Hand Offs
• Problems can arise when a shift changes
– What problems did old shift deal with?
– What problems were totally solved?
– What problems were partially solved?
– What operations activities are currently ongoing?
14

Operations is a Target Rich Environment
• There are many existing tools. Operation of data
centers would not work without tools
• Much room for improvement (see Gartner quote)
• Some general approaches for improvement
– Make software systems operations and tools process and
incident aware. E.g. make them aware of upgrade or shift
change
– Model operations processes and systems using a single model.
• Model analysis will provide opportunities for detecting trade offs between
human and automated activities.
• Model might also enable smoother error detection
15

Outline
• Overview of operations domain
• Some results we have achieved or activities
we have ongoing
– Disaster Recovery product
– Upgrade
– Operator undo
– Installation process.
16

Disaster Recovery
• Clouds fail – Amazon had three outages in 2011
that affected whole availability zones or regions.
• NICTA has a subsidiary (Yuruware) with a non-
intrusive disaster recovery product (Bolt).
• Bolt copies data periodically to a back up region.
• Bolt utilizes sophisticated data movement
techniques to reduce time required to back up
• This is an insurance policy.
17

Next Problem – Upgrade
• Upgrades are a very common occurrence
• Upgrade frequency of some common systems
• Some systems have multiple releases per day,
driven by developers – continuous deployment
18
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 days
Joomla 30 days

Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries. This
is using production environment for testing)
– All at once (big flip)
• How long are new versions tested to determine
correctness?
– Period based – for some period of time
– Load based – under some utilization assumptions
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
19

Having Multiple Versions Simulaneously
Active May Lead to Mixed Version Race
Condition
20
Server 2 (new
version
3
4
X ERROR
Initial request
Client (browser)
Server 1 (old
version
1
2
5
Start rolling
upgrade
HTTP reply with
embedded JavaScript
AJAX callback

One Method for Preventing Mixed Version Race Condition
is to Make Load Balancers Version Aware
Client may
request
particular version
of a service
External facing
Router (wrt to
cloud)
Internal Router
Server for
Version A
Server for
Version A
Server for
Version B
Internal Router
Server for
Version A
Server for
Version B
21
At each level of the routing hierarchy
there are two possibilities for each
request
• Request is neutral with respect to
version
• Request specifies version
Routing must
• Be fast to ensure rapid response
• Satisfy “goodness” criteria for
scheduling
• Conform to client request wrt
version.
In addition:
• Servers are being
upgraded to a later
version while servicing
client requests
• Load variation may
trigger elasticity rules

What is Criterion for Measuring Load
Balancer Scheduling?
• What is “goodness” with respect to routing
decisions within the constraints of scheduling
strategy and version awareness?
– Uniform distribution of requests?
– Keeping utilization within bounds?
– Utilizing wide variety of clients?
– Other?
• Main result so far. Version awareness is
incompatible with any of the above “goodness”
criteria for the staged upgrade strategy.
22

Canary or Staged Strategy
• Upgrade one or several servers to new version
and leave them for some time.
• Formulation:
– Staged upgrade
• M version A servers (constant number)
• N version B servers (constant number)
• Fixed number of clients
– Version aware
• Once a client has had a request serviced by a version B
server it cannot subsequently have any requests serviced by
any version A server.
23

Bifurcation of Clients
• Clients are bifurcated into version A clients and
version B clients after some time
– Intuitively, for each client, either it is serviced by a
server with version B and consequently never served
by any server with version A or never served by a
server with version B. So each client ends in up the
Server A class or the Server B class but not both.
• We call clients that end up being serviced by
services with version A, class A clients.
Similarly, for class B clients.
• Allowing additional clients does not
fundamentally change result.
24

Bifurcation of Clients Implies
• Cannot control for utilization unless create new instances
of version B in response to demand
– There are a fixed number of clients sending requests to a fixed
number of servers with version B. Cannot vary the number of
servers to reflect the load generated by the fixed set of clients.
Consequently cannot control the utilization by servers with
version B.
• Cannot control for uniform distribution.
– Uniform distribution means that every request has an equal
change of being sent to any server. If a client is in class A, then it
has 0% chance of being sent to a server with Version B.
• Difficult to control for wide variety of clients.
– Variations among the clients must be mirrored within class A and
class B clients since the classes are fixed after the bifurcation.
This is difficult to accomplish since types of variations that are
important are usually not known.
25

Questions to Answer.
– How long does it take to reach bifurcated state under
what assumptions?
– How can the goals of staged upgrade be achieved
within the constraints of version awareness?
26

Next Problem
• Operators use scripts to perform actions such as
update
• Scripts may fail
– May be result of API failure (more on this later)
– May be desire to set up testing environment
– May be result of failure of underlying virtual machine.
• When a script fails, the operator may wish to
return to a known state (undo several
operations)
27

Operator Undo
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
28

Undo for System Operators
29
+ commit
+ pseudo-delete
begin-
transaction
rollback
do
do
do
Administrator

Approach
30
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System

Approach
31
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state

begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
Set of
actions
Set of
actions
Approach
32

What about API Failures?
• Operator scripts make heavy use of checking or
controlling state of resources
– Start/stop VM
– Is VM active?
• These scripts becomes calls to the cloud
provider’s API.
• Calls may fail
– Underlying VM has failed
– Eventual consistency.
33

We Have Performed an Empirical Study of
API Failures in EC2
• 922 cases out of 1109 reported API-related
cases in the EC2 forum from 2010 to 2012 are
API failures (rather than feature requests or
general inquiries).
• We classified the extracted API failures into four
types of failures:
– content failures,
– late timing failures,
– halt failures, and
– erratic failures.
34

Results
• A majority (60%) of the cases of API failure are related to
stuck API calls or unresponsive API calls.
• A large portion (12%) of the cases are about slow
responsive API calls.
• 19% of the cases are related to the output issues of API
calls, including failed calls with unclear error
messages, as well as missing output, wrong output, and
unexpected output of API calls.
• 9% of the cases reported that their calls were pending
for a certain time and then returned to the original state
without informing the caller properly or the calls were
reported to be successful first but failed later.
35

Next Problem - Operations Processes
• We are looking at the process of installing new
software
– Error Prone
– Potential process improvements.
36

Motivating Scenario
• You change the operating environment for an
application
– Configuration change
– Version change
– Hardware change
• Result is degraded performance
• When the software stack is deep with portions
from different suppliers, the result is frequently:
37

Why is Installation Error Prone?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• HBase has 64
– Many dependencies are not visible until execution
38

Installation Processes
• Processes may be
– Undocumented
– Out of date
– Insufficiently detailed
• Our goal is to build process model including
error recovery mechanisms
39

Our Activities
40
• Create up to date process models for installation
processes. Information sources are
– Process discovery from logs
– Process formalization from existing written
descriptions.
• Process descriptions can be used to
– Make trade offs
– Make recommendations in real time to operations
staff
– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered
– Assist in the detection of errors

Hard Problems
41
• Creating accurate process models
– Exception handling mechanisms are not well
documented
– Labor intensive.
– Our approach
• Top down modeling using process modeling formalism
• Bottom up process mining from error logs
• Diagnosing errors

Why is Error Diagnosis Hard?
In a distributed computing
environment, when an error
occurs during operations, it is
difficult and time consuming to
diagnosis it.
Diagnosis involves correlating
messages from
• different distributed servers
• different portions of the
software stack
and determining the root
cause of the error.
The root cause, in turn, may
be within a portion of the stack
that is different from where the
error is observed.

Test Bed
43
Our current test bed is the Hbase stack

Currently Performing Analysis of
Configuration Errors
44
• Cross stack errors may take hours to diagnose
– Log files are inconsistent
– Error message may not give context necessary to
determine root cause.

Where to Find Information about Operations
Domain?
• Every open source program requires a variety of
configuration parameters.
• Every modern application depends on a variety
of middleware so cross domain examples should
be readily available.
• Most organizations have extensive processes for
their operations personnel. Use these processes
as a framework for investigating process/product
interactions.
45

Summary
• Operations problems will account for the majority
of outages and IT costs in the next several
years.
• The operations space is a rich source of
research problems that has been insufficiently
mined.
• Best way to determine what problems to attack
is to monitor or interview operators
46

NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Jim Zhanwen Li
• Qinghua Lu
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
47

Supporting operations personnel a software engineers perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Supporting operations personnel a software engineers perspective

Similar to Supporting operations personnel a software engineers perspective (20)

More from Len Bass

More from Len Bass (20)

Recently uploaded

Recently uploaded (20)

Supporting operations personnel a software engineers perspective