Architectural Tactics for Large Scale Systems

NICTA Copyright 2012 From imagination to impact
Architecture Tactics for
Large-Scale Systems
to Manage Changes
Len Bass

2
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry
NICTA technology is
in over 1 billion mobile
phones

WICSA 2014 is in Sydney!!
Working IEEE/IFIP Conference on Software
Architecture (WICSA) is the pre-eminent
software architecture conference
April 7-11, 2014

Traditional View of Large Scale Systems
4
Application
Cloud
Environment
Traditionally, the software engineering community
has viewed systems as being developed for users
and existing in an environment. The motivating
questions have been: With this world view: how can
development costs be reduced and run time quality
improved?
End users
Developers

A Broader View
5
Application
Cloud
Environment
Applications are not only affected by the behavior of the
end users but also by actions of operators who control
the environment for a consumer’s application.
Consumer
Operator
End users
Developers

My Message: Applications must
respond to change caused by the
environment and the operators as well
as new processes used during
development.
Application
Cloud
Environment
Consumer
Operator
End users
Developers
.

Applications must be aware of
7
• Failure and its causes
• Consistency issues
• Continuous deployment practices
• Multiple simultaneous versions active
• The remainder of this talk will discuss why
applications should have this kind of awareness
and what tactics are used to address the
problems.

Failure and its causes
8
A year in the life of a Google data center (from Jeff Dean)
• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days
to recover)
• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours
to come back)
• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to
get back)
• ~5 racks go wonky (40-80 machines see 50% packetloss)
• ~12 router reloads (takes out DNS and external vips for a couple
minutes)
• ~3 router failures (have to immediately pull traffic for an hour)
• ~dozens of minor 30-second blips for dns
• ~1000 individual machine failures
• ~thousands of hard drive failures
• slow disks, bad memory, misconfigured machines, flaky
machines, etc.

Consequence for cloud consumers
9
• Failure is pervasive.
• Cloud as a whole is reliable (99.5% availability)
but any particular physical component is not.
• This means applications must be aware of the
possibility of virtual machine failure.
• Applications must be constructed to be fault
tolerant.

Detection of fault
10
• Two techniques
– Heartbeat – component sends periodic messaging
indicating that it is alive
– Timeout – client of component sets a deadline after
which
• Component will be assumed to have failed.
• Messages will be assumed to have gotten lost
• Netflix (US video streaming service) advocates fast
failure.
– Clients set short timeout.
– Results in better response time if component failed
– May result in “false positive” whereby component is
assumed to have failed but, in reality, is still alive.
– If client retries request, it may be executed twice.

Recovery from fault
11
• Redundancy of computation and data
– Redundancy of data will be discussed in next section on
consistency
– Redundancy of computation is typically achieved by making
services stateless.
• Can send failed messages to new instance. Need to be
concerned about second execution if first message was, in
fact, acted on
• Can instantiate new copy of service if failure is caused by
overloading.
• Alternative means for accomplishing service
– Some services can be accomplished in using different
mechanisms. Consider one mechanism as a fallback to a
primary.
– Degraded service might be possible.

Undo
• After performing an operation in AWS, may want to go
back to original state – i.e. Undo the operation
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
12

Undo using transaction approach
13
+ commit
+ pseudo-delete
begin-
transaction
rollback
do
do
do
Administrator

Approach
14
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System

Approach
15
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state

begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
Set of
actions
Set of
actions
Approach
16

Report fault
17
• Through logs.
– Correlating logs can be difficult
– Tracking logs to root causes can be very difficult.
• Through reporting to parent service.
– It, in turn, may have alternative means of achieving its
goals, including undo.

Consistency issues
18
• Data is frequently replicated.
– NoSQL data bases all replicate data
• Replication takes time.
– Means that inconsistent versions of data may exist
• One (or more) that has been updated
• One (or more) that has not yet received the updates.
– Leads to phenomenon known as “eventual
consistency”
– May take ½ second to become consistent.

19
Characterising Eventual Consistency in
Amazon SimpleDB
• The probability to read updated data in SimpleDB in US West
– An application reads data X (ms) after it has written data
• SimpleDB has two
read operations
– Eventual Consistent
Read
– Consistent Read
• This pattern is
consistent
regardless of the
time of day
Eventual ConsistentConsistent Read

Other types of inconsistency
• Configuration parameters
– All instances should have same settings in terms of
security, locality, etc.
• Synchronization locks. Locks shared across distributed
instances may not be in a consistent state.
• One mechanism is to have consistency manager.
– Complicated since centralized consistency manager
may fail and distributed consistency managers must
be coordinated.
– Zookeeper is an open source tool that manages
consistency for distributed applications at a small cost
in latency.
20

Continuous deployment practices
• Many organizations have developers deploy
after changes tested
– Google
– Amazon
– Linkedin
– Netflix
• Leads to following types of problems
– Multiple simultaneous versions active
– Errors occurring during installation
21

Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries)
– All at once (big flip)
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
22

Services Can be Bundled in Two Fashions
• Tightly Coupled
– Google
– Facebook
• Loosely Coupled
– Amazon
– Linkedin
23

Tightly Coupled Services
• Deployment unit is tier
• A tier bundles multiple services into one virtual
machine
• Tier 1
• Tier 2
24

Loosely Coupled Services
• Deep service
dependency
hierarchy – may be
70 deep
• Upgrading one
service in this
hierarchy
• Need to consider
both service and its
clients
• Each service is a
Virtual Machine
25
Figure from Netflix Tech Blog

Comparing Two Options
• Both options provide for horizontal scaling based
on load
• Both options provide for failure recovery
– Tightly coupled option will replace tier
– Loosely coupled option will replace service
– Failure recovery assumes stateless Virtual Machines
• Differ
– How updates and canaries are managed (I will
discuss in a moment)
– How unwanted dependencies are avoided
• Tightly coupled option depends on developer discipline
• Loosely coupled option avoids unwanted dependencies
through information hiding.
26

Common upgrade strategy
• Require all versions to be backward compatible
with previous versions
• Require changes associated with new version to
be software switchable.
• Clients of a service must be version aware in
order to know whether to utilize new
functionality.
• Once all instances have been upgraded to new
versions, send signal to turn on changes both in
the new version and their clients.
• When using canaries only turn on changes for a
subset of services and their clients. 27

Current state of major internet provider
• Each service has an owner
• Every service instance is instrumented
• When a canary is deployed, service owner
examines monitoring data (next slide) and uses
judgment to decide when to move to production.
• Canary testing is currently based on
functionality. No stress testing of canaries.
28

Netflix Monitoring Sequence
29
• Client outbound (start/end)
• Network (start/end)
• Service network (inbound start/end)
• Service processing (start/end)
• Service outbound (start/end)
• Network (start/end)
• Client inbound (start/end)

General picture for version aware loosely coupled
services
Client
Top Level
load
balancer
Second
level load
balancer
Server for
Version A
Server for
Version A
Server for
Version B
Second
level load
balancer
Server for
Version A
Server for
Version B
30
Client
• Version aware
• Must know about new versions
In order to take advantage of
new functionality
• May be implicitly version aware
based on, e.g. cluster
• Version unaware clients will only use
old functionality and these can be
served by any server since services
are backward compatible.
In addition:
• Load variation may
trigger elasticity rules.
• Deciding whether to
load new version or old
version raises other
issues.

Canary Issues
• Canaries are a form of live testing. Put a new
version into limited production to test its
correctness.
• Issues
– How long are new versions tested to determine
correctness?
• Period based – for some period of time
• Load based – under some utilization assumptions
• Result based – until some criteria is met
– How are clients of new version chosen and how is
this choice enforced?
– How are the canaries deployed?
31

Use of canaries with tightly coupled services
• Version awareness does not need to extend to
load balancers
– Services and clients are bound into VM
– Services and clients that are used to test new version
are in single VM and have no need for version aware
load balancers.
32

More Detail on Upgrade Process
• Canaries are deployed and allowed to run for a
period without turning on new features.
• This is to test backward compatibility.
• Once canaries pass this test, then the new
features are turned on.
33

Installation Motivating Scenario
• You change the operating environment for an
application
– Configuration change
– Version change
– Hardware change
• Result is degraded performance
• When the software stack is deep with portions
from different suppliers, the result is frequently:
34

Why is Installation Error Prone?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• HBase has 64
– Many dependencies are not visible until execution
35

Installation Processes
• Processes may be
– Undocumented
– Out of date
– Insufficiently detailed
• Our goal is to build process model including
error recovery mechanisms
36

Our Activities
37
• Create up to date process models for installation
processes. Information sources are
– Process discovery from logs
– Process formalization from existing written
descriptions.
• Process descriptions can be used to
– Make trade offs
– Make recommendations in real time to operations
staff
– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered
– Assist in the detection of errors

Hard Problems
38
• Creating accurate process models
– Exception handling mechanisms are not well
documented
– Noisy logs
– Our approach
• Top down modeling using process modeling formalism
• Bottom up process mining from error logs
• Diagnosing errors

Why is Error Diagnosis Hard?
In a distributed computing
environment, when an error
occurs during operations, it is
difficult and time consuming to
diagnosis it.
Diagnosis involves correlating
messages from
• different distributed servers
• different portions of the
software stack
and determining the root
cause of the error.
The root cause, in turn, may
be within a portion of the stack
that is different from where the
error is observed.

Test Bed
40
Our current test bed is the Hbase stack

Currently Performing Analysis of
Configuration Errors
41
• Cross stack errors may take hours to diagnose
– Log files are inconsistent
– Error message may not give context necessary to
determine root cause.

Summary
42
• The modern cloud environment and modern
development practices have introduced new
problems or made more important old problems.
• Tactics exist to deal with some of these
problems.
• Developing tactics for other problems is a matter
of research.

NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Jim Zhanwen Li
• Qinghua Lu
• Sherif Sakr
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
43

Architectural Tactics for Large Scale Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Architectural Tactics for Large Scale Systems

Similar to Architectural Tactics for Large Scale Systems (20)

More from Len Bass

More from Len Bass (20)

Recently uploaded

Recently uploaded (20)

Architectural Tactics for Large Scale Systems