SlideShare a Scribd company logo
NICTA Copyright 2012 From imagination to impact
Architecture Tactics for
Large-Scale Systems
to Manage Changes
Len Bass
NICTA Copyright 2012 From imagination to impact
2
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry
NICTA technology is
in over 1 billion mobile
phones
NICTA Copyright 2012 From imagination to impact
WICSA 2014 is in Sydney!!
Working IEEE/IFIP Conference on Software
Architecture (WICSA) is the pre-eminent
software architecture conference
April 7-11, 2014
NICTA Copyright 2012 From imagination to impact
Traditional View of Large Scale Systems
4
Application
Cloud
Environment
Traditionally, the software engineering community
has viewed systems as being developed for users
and existing in an environment. The motivating
questions have been: With this world view: how can
development costs be reduced and run time quality
improved?
End users
Developers
NICTA Copyright 2012 From imagination to impact
A Broader View
5
Application
Cloud
Environment
Applications are not only affected by the behavior of the
end users but also by actions of operators who control
the environment for a consumer’s application.
Consumer
Operator
End users
Developers
NICTA Copyright 2012 From imagination to impact
My Message: Applications must
respond to change caused by the
environment and the operators as well
as new processes used during
development.
Application
Cloud
Environment
Consumer
Operator
End users
Developers
.
NICTA Copyright 2012 From imagination to impact
Applications must be aware of
7
• Failure and its causes
• Consistency issues
• Continuous deployment practices
• Multiple simultaneous versions active
• The remainder of this talk will discuss why
applications should have this kind of awareness
and what tactics are used to address the
problems.
NICTA Copyright 2012 From imagination to impact
Failure and its causes
8
A year in the life of a Google data center (from Jeff Dean)
• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days
to recover)
• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours
to come back)
• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to
get back)
• ~5 racks go wonky (40-80 machines see 50% packetloss)
• ~12 router reloads (takes out DNS and external vips for a couple
minutes)
• ~3 router failures (have to immediately pull traffic for an hour)
• ~dozens of minor 30-second blips for dns
• ~1000 individual machine failures
• ~thousands of hard drive failures
• slow disks, bad memory, misconfigured machines, flaky
machines, etc.
NICTA Copyright 2012 From imagination to impact
Consequence for cloud consumers
9
• Failure is pervasive.
• Cloud as a whole is reliable (99.5% availability)
but any particular physical component is not.
• This means applications must be aware of the
possibility of virtual machine failure.
• Applications must be constructed to be fault
tolerant.
NICTA Copyright 2012 From imagination to impact
Detection of fault
10
• Two techniques
– Heartbeat – component sends periodic messaging
indicating that it is alive
– Timeout – client of component sets a deadline after
which
• Component will be assumed to have failed.
• Messages will be assumed to have gotten lost
• Netflix (US video streaming service) advocates fast
failure.
– Clients set short timeout.
– Results in better response time if component failed
– May result in “false positive” whereby component is
assumed to have failed but, in reality, is still alive.
– If client retries request, it may be executed twice.
NICTA Copyright 2012 From imagination to impact
Recovery from fault
11
• Redundancy of computation and data
– Redundancy of data will be discussed in next section on
consistency
– Redundancy of computation is typically achieved by making
services stateless.
• Can send failed messages to new instance. Need to be
concerned about second execution if first message was, in
fact, acted on
• Can instantiate new copy of service if failure is caused by
overloading.
• Alternative means for accomplishing service
– Some services can be accomplished in using different
mechanisms. Consider one mechanism as a fallback to a
primary.
– Degraded service might be possible.
NICTA Copyright 2012 From imagination to impact
Undo
• After performing an operation in AWS, may want to go
back to original state – i.e. Undo the operation
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
12
NICTA Copyright 2012 From imagination to impact
Undo using transaction approach
13
+ commit
+ pseudo-delete
begin-
transaction
rollback
do
do
do
Administrator
NICTA Copyright 2012 From imagination to impact
Approach
14
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
NICTA Copyright 2012 From imagination to impact
Approach
15
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
NICTA Copyright 2012 From imagination to impact
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
Set of
actions
Set of
actions
Approach
16
NICTA Copyright 2012 From imagination to impact
Report fault
17
• Through logs.
– Correlating logs can be difficult
– Tracking logs to root causes can be very difficult.
• Through reporting to parent service.
– It, in turn, may have alternative means of achieving its
goals, including undo.
NICTA Copyright 2012 From imagination to impact
Consistency issues
18
• Data is frequently replicated.
– NoSQL data bases all replicate data
• Replication takes time.
– Means that inconsistent versions of data may exist
• One (or more) that has been updated
• One (or more) that has not yet received the updates.
– Leads to phenomenon known as “eventual
consistency”
– May take ½ second to become consistent.
NICTA Copyright 2012 From imagination to impact
19
Characterising Eventual Consistency in
Amazon SimpleDB
• The probability to read updated data in SimpleDB in US West
– An application reads data X (ms) after it has written data
• SimpleDB has two
read operations
– Eventual Consistent
Read
– Consistent Read
• This pattern is
consistent
regardless of the
time of day
Eventual ConsistentConsistent Read
NICTA Copyright 2012 From imagination to impact
Other types of inconsistency
• Configuration parameters
– All instances should have same settings in terms of
security, locality, etc.
• Synchronization locks. Locks shared across distributed
instances may not be in a consistent state.
• One mechanism is to have consistency manager.
– Complicated since centralized consistency manager
may fail and distributed consistency managers must
be coordinated.
– Zookeeper is an open source tool that manages
consistency for distributed applications at a small cost
in latency.
20
NICTA Copyright 2012 From imagination to impact
Continuous deployment practices
• Many organizations have developers deploy
after changes tested
– Google
– Amazon
– Linkedin
– Netflix
• Leads to following types of problems
– Multiple simultaneous versions active
– Errors occurring during installation
21
NICTA Copyright 2012 From imagination to impact
Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries)
– All at once (big flip)
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
22
NICTA Copyright 2012 From imagination to impact
Services Can be Bundled in Two Fashions
• Tightly Coupled
– Google
– Facebook
• Loosely Coupled
– Amazon
– Linkedin
23
NICTA Copyright 2012 From imagination to impact
Tightly Coupled Services
• Deployment unit is tier
• A tier bundles multiple services into one virtual
machine
• Tier 1
• Tier 2
24
NICTA Copyright 2012 From imagination to impact
Loosely Coupled Services
• Deep service
dependency
hierarchy – may be
70 deep
• Upgrading one
service in this
hierarchy
• Need to consider
both service and its
clients
• Each service is a
Virtual Machine
25
Figure from Netflix Tech Blog
NICTA Copyright 2012 From imagination to impact
Comparing Two Options
• Both options provide for horizontal scaling based
on load
• Both options provide for failure recovery
– Tightly coupled option will replace tier
– Loosely coupled option will replace service
– Failure recovery assumes stateless Virtual Machines
• Differ
– How updates and canaries are managed (I will
discuss in a moment)
– How unwanted dependencies are avoided
• Tightly coupled option depends on developer discipline
• Loosely coupled option avoids unwanted dependencies
through information hiding.
26
NICTA Copyright 2012 From imagination to impact
Common upgrade strategy
• Require all versions to be backward compatible
with previous versions
• Require changes associated with new version to
be software switchable.
• Clients of a service must be version aware in
order to know whether to utilize new
functionality.
• Once all instances have been upgraded to new
versions, send signal to turn on changes both in
the new version and their clients.
• When using canaries only turn on changes for a
subset of services and their clients. 27
NICTA Copyright 2012 From imagination to impact
Current state of major internet provider
• Each service has an owner
• Every service instance is instrumented
• When a canary is deployed, service owner
examines monitoring data (next slide) and uses
judgment to decide when to move to production.
• Canary testing is currently based on
functionality. No stress testing of canaries.
28
NICTA Copyright 2012 From imagination to impact
Netflix Monitoring Sequence
29
• Client outbound (start/end)
• Network (start/end)
• Service network (inbound start/end)
• Service processing (start/end)
• Service outbound (start/end)
• Network (start/end)
• Client inbound (start/end)
NICTA Copyright 2012 From imagination to impact
General picture for version aware loosely coupled
services
Client
Top Level
load
balancer
Second
level load
balancer
Server for
Version A
Server for
Version A
Server for
Version B
Second
level load
balancer
Server for
Version A
Server for
Version B
30
Client
• Version aware
• Must know about new versions
In order to take advantage of
new functionality
• May be implicitly version aware
based on, e.g. cluster
• Version unaware clients will only use
old functionality and these can be
served by any server since services
are backward compatible.
In addition:
• Load variation may
trigger elasticity rules.
• Deciding whether to
load new version or old
version raises other
issues.
NICTA Copyright 2012 From imagination to impact
Canary Issues
• Canaries are a form of live testing. Put a new
version into limited production to test its
correctness.
• Issues
– How long are new versions tested to determine
correctness?
• Period based – for some period of time
• Load based – under some utilization assumptions
• Result based – until some criteria is met
– How are clients of new version chosen and how is
this choice enforced?
– How are the canaries deployed?
31
NICTA Copyright 2012 From imagination to impact
Use of canaries with tightly coupled services
• Version awareness does not need to extend to
load balancers
– Services and clients are bound into VM
– Services and clients that are used to test new version
are in single VM and have no need for version aware
load balancers.
32
NICTA Copyright 2012 From imagination to impact
More Detail on Upgrade Process
• Canaries are deployed and allowed to run for a
period without turning on new features.
• This is to test backward compatibility.
• Once canaries pass this test, then the new
features are turned on.
33
NICTA Copyright 2012 From imagination to impact
Installation Motivating Scenario
• You change the operating environment for an
application
– Configuration change
– Version change
– Hardware change
• Result is degraded performance
• When the software stack is deep with portions
from different suppliers, the result is frequently:
34
NICTA Copyright 2012 From imagination to impact
Why is Installation Error Prone?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• HBase has 64
– Many dependencies are not visible until execution
35
NICTA Copyright 2012 From imagination to impact
Installation Processes
• Processes may be
– Undocumented
– Out of date
– Insufficiently detailed
• Our goal is to build process model including
error recovery mechanisms
36
NICTA Copyright 2012 From imagination to impact
Our Activities
37
• Create up to date process models for installation
processes. Information sources are
– Process discovery from logs
– Process formalization from existing written
descriptions.
• Process descriptions can be used to
– Make trade offs
– Make recommendations in real time to operations
staff
– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered
– Assist in the detection of errors
NICTA Copyright 2012 From imagination to impact
Hard Problems
38
• Creating accurate process models
– Exception handling mechanisms are not well
documented
– Noisy logs
– Our approach
• Top down modeling using process modeling formalism
• Bottom up process mining from error logs
• Diagnosing errors
NICTA Copyright 2012 From imagination to impact
Why is Error Diagnosis Hard?
In a distributed computing
environment, when an error
occurs during operations, it is
difficult and time consuming to
diagnosis it.
Diagnosis involves correlating
messages from
• different distributed servers
• different portions of the
software stack
and determining the root
cause of the error.
The root cause, in turn, may
be within a portion of the stack
that is different from where the
error is observed.
NICTA Copyright 2012 From imagination to impact
Test Bed
40
Our current test bed is the Hbase stack
NICTA Copyright 2012 From imagination to impact
Currently Performing Analysis of
Configuration Errors
41
• Cross stack errors may take hours to diagnose
– Log files are inconsistent
– Error message may not give context necessary to
determine root cause.
NICTA Copyright 2012 From imagination to impact
Summary
42
• The modern cloud environment and modern
development practices have introduced new
problems or made more important old problems.
• Tactics exist to deal with some of these
problems.
• Developing tactics for other problems is a matter
of research.
NICTA Copyright 2012 From imagination to impact
NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Jim Zhanwen Li
• Qinghua Lu
• Sherif Sakr
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
43

More Related Content

What's hot

Network Troubleshooting - Part 1
Network Troubleshooting - Part 1Network Troubleshooting - Part 1
Network Troubleshooting - Part 1
SolarWinds
 
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
SolarWinds
 
Your Applications Are Distributed, How About Your Network Analysis Solution?
Your Applications Are Distributed, How About Your Network Analysis Solution?Your Applications Are Distributed, How About Your Network Analysis Solution?
Your Applications Are Distributed, How About Your Network Analysis Solution?
Savvius, Inc
 
Government and Education: IT Tools to Support Your Hybrid Workforce
Government and Education: IT Tools to Support Your Hybrid WorkforceGovernment and Education: IT Tools to Support Your Hybrid Workforce
Government and Education: IT Tools to Support Your Hybrid Workforce
SolarWinds
 
Branch Office Infrastructure
Branch Office InfrastructureBranch Office Infrastructure
Branch Office Infrastructure
Aidan Finn
 
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
Interoute
 
StruxureWare DCIM
StruxureWare DCIMStruxureWare DCIM
StruxureWare DCIM
Rogier den Boer
 
NER & Emerson Infrastructure Optimization Capabilties Storyboard
NER & Emerson   Infrastructure Optimization Capabilties StoryboardNER & Emerson   Infrastructure Optimization Capabilties Storyboard
NER & Emerson Infrastructure Optimization Capabilties Storyboard
Greg Stover
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basicssysnickm
 
Earthlink introduction and its overview eb 01-16-04
Earthlink introduction and its overview   eb  01-16-04 Earthlink introduction and its overview   eb  01-16-04
Earthlink introduction and its overview eb 01-16-04
E B
 
Datacenter best practices design and implementation
Datacenter best practices design and implementationDatacenter best practices design and implementation
Datacenter best practices design and implementation
Anton An
 
ZD&T Survival Kit
ZD&T Survival KitZD&T Survival Kit
ZD&T Survival Kit
Michael Erichsen
 
Cloud computing
Cloud computingCloud computing
Cloud computing
Rohith Shankar
 
Cloud Engineering
Cloud EngineeringCloud Engineering
Cloud Engineering
Gwendal Simon
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
Tyler Treat
 
Sioux Hot-or-Not: The future of Linux (Alan Cox)
Sioux Hot-or-Not: The future of Linux (Alan Cox)Sioux Hot-or-Not: The future of Linux (Alan Cox)
Sioux Hot-or-Not: The future of Linux (Alan Cox)
siouxhotornot
 
Government and Education Webinar: Recovering IP Addresses on Your Network
Government and Education Webinar: Recovering IP Addresses on Your NetworkGovernment and Education Webinar: Recovering IP Addresses on Your Network
Government and Education Webinar: Recovering IP Addresses on Your Network
SolarWinds
 
Neuralstar- Network Management System
Neuralstar- Network Management SystemNeuralstar- Network Management System
Neuralstar- Network Management System
Manish Jha
 
Introductorytocomputing
IntroductorytocomputingIntroductorytocomputing
Introductorytocomputing
Anne Starr
 

What's hot (20)

Network Troubleshooting - Part 1
Network Troubleshooting - Part 1Network Troubleshooting - Part 1
Network Troubleshooting - Part 1
 
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
SolarWinds Product Management Technical Drilldown on Deep Packet Inspection a...
 
Your Applications Are Distributed, How About Your Network Analysis Solution?
Your Applications Are Distributed, How About Your Network Analysis Solution?Your Applications Are Distributed, How About Your Network Analysis Solution?
Your Applications Are Distributed, How About Your Network Analysis Solution?
 
Government and Education: IT Tools to Support Your Hybrid Workforce
Government and Education: IT Tools to Support Your Hybrid WorkforceGovernment and Education: IT Tools to Support Your Hybrid Workforce
Government and Education: IT Tools to Support Your Hybrid Workforce
 
Branch Office Infrastructure
Branch Office InfrastructureBranch Office Infrastructure
Branch Office Infrastructure
 
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
Infrastructure Migration from Windows Server 2003 to the Cloud: An Interoute ...
 
StruxureWare DCIM
StruxureWare DCIMStruxureWare DCIM
StruxureWare DCIM
 
NER & Emerson Infrastructure Optimization Capabilties Storyboard
NER & Emerson   Infrastructure Optimization Capabilties StoryboardNER & Emerson   Infrastructure Optimization Capabilties Storyboard
NER & Emerson Infrastructure Optimization Capabilties Storyboard
 
DevCon13 System Administration Basics
DevCon13 System Administration BasicsDevCon13 System Administration Basics
DevCon13 System Administration Basics
 
Earthlink introduction and its overview eb 01-16-04
Earthlink introduction and its overview   eb  01-16-04 Earthlink introduction and its overview   eb  01-16-04
Earthlink introduction and its overview eb 01-16-04
 
Datacenter best practices design and implementation
Datacenter best practices design and implementationDatacenter best practices design and implementation
Datacenter best practices design and implementation
 
Datacenter
DatacenterDatacenter
Datacenter
 
ZD&T Survival Kit
ZD&T Survival KitZD&T Survival Kit
ZD&T Survival Kit
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud Engineering
Cloud EngineeringCloud Engineering
Cloud Engineering
 
The Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going DistributedThe Economics of Scale: Promises and Perils of Going Distributed
The Economics of Scale: Promises and Perils of Going Distributed
 
Sioux Hot-or-Not: The future of Linux (Alan Cox)
Sioux Hot-or-Not: The future of Linux (Alan Cox)Sioux Hot-or-Not: The future of Linux (Alan Cox)
Sioux Hot-or-Not: The future of Linux (Alan Cox)
 
Government and Education Webinar: Recovering IP Addresses on Your Network
Government and Education Webinar: Recovering IP Addresses on Your NetworkGovernment and Education Webinar: Recovering IP Addresses on Your Network
Government and Education Webinar: Recovering IP Addresses on Your Network
 
Neuralstar- Network Management System
Neuralstar- Network Management SystemNeuralstar- Network Management System
Neuralstar- Network Management System
 
Introductorytocomputing
IntroductorytocomputingIntroductorytocomputing
Introductorytocomputing
 

Viewers also liked

Internet of Things: Patterns For Building Real World Applications
Internet of Things: Patterns For Building Real World ApplicationsInternet of Things: Patterns For Building Real World Applications
Internet of Things: Patterns For Building Real World Applications
Ivan Dwyer
 
Practices of Good Software Architects
Practices of Good Software ArchitectsPractices of Good Software Architects
Practices of Good Software Architects
Eberhard Wolff
 
SOLID Principles and Design Patterns
SOLID Principles and Design PatternsSOLID Principles and Design Patterns
SOLID Principles and Design Patterns
Ganesh Samarthyam
 
Layered architecture style
Layered architecture styleLayered architecture style
Layered architecture style
Begench Suhanov
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagramming
meghantaylor
 
Design principles and elements
Design principles and elementsDesign principles and elements
Design principles and elementsSimphiwe Dumengane
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Svetlin Nakov
 
Layered Software Architecture
Layered Software ArchitectureLayered Software Architecture
Layered Software Architecture
Lars-Erik Kindblad
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocity
Sam Newman
 

Viewers also liked (10)

Design principles
Design principlesDesign principles
Design principles
 
Internet of Things: Patterns For Building Real World Applications
Internet of Things: Patterns For Building Real World ApplicationsInternet of Things: Patterns For Building Real World Applications
Internet of Things: Patterns For Building Real World Applications
 
Practices of Good Software Architects
Practices of Good Software ArchitectsPractices of Good Software Architects
Practices of Good Software Architects
 
SOLID Principles and Design Patterns
SOLID Principles and Design PatternsSOLID Principles and Design Patterns
SOLID Principles and Design Patterns
 
Layered architecture style
Layered architecture styleLayered architecture style
Layered architecture style
 
A Software Architect's View On Diagramming
A Software Architect's View On DiagrammingA Software Architect's View On Diagramming
A Software Architect's View On Diagramming
 
Design principles and elements
Design principles and elementsDesign principles and elements
Design principles and elements
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
 
Layered Software Architecture
Layered Software ArchitectureLayered Software Architecture
Layered Software Architecture
 
Principles of microservices velocity
Principles of microservices   velocityPrinciples of microservices   velocity
Principles of microservices velocity
 

Similar to Architectural Tactics for Large Scale Systems

Deployability
DeployabilityDeployability
Deployability
Len Bass
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
Decision Science Community
 
Patterns of enterprise application architecture
Patterns of enterprise application architecturePatterns of enterprise application architecture
Patterns of enterprise application architecture
Chinh Ngo Nguyen
 
E2 evc 3-2-1-rule - mikeresseler
E2 evc   3-2-1-rule - mikeresselerE2 evc   3-2-1-rule - mikeresseler
E2 evc 3-2-1-rule - mikeresseler
Mike Resseler
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
Fabricio Epaminondas
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Sergey Platonov
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
Len Bass
 
SDN Demystified, by Dean Pemberton [APNIC 38]
SDN Demystified, by Dean Pemberton [APNIC 38]SDN Demystified, by Dean Pemberton [APNIC 38]
SDN Demystified, by Dean Pemberton [APNIC 38]
APNIC
 
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
DellNMS
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
VMware Tanzu
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
DellNMS
 
Deploying at will - SEI
 Deploying at will - SEI Deploying at will - SEI
Software Defined Networking: Network Virtualization
Software Defined Networking: Network VirtualizationSoftware Defined Networking: Network Virtualization
Software Defined Networking: Network Virtualization
NetCraftsmen
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracled0nn9n
 
10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System
Inductive Automation
 
Lecture 6 cloud
Lecture 6   cloudLecture 6   cloud
Lecture 6 cloud
Naomi Unkelos-Shpigel
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution BlueprintMike Alvarado
 
10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System
Inductive Automation
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Biswajit Pratihari
 
Cloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr AliCloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr Ali
Amr Ali
 

Similar to Architectural Tactics for Large Scale Systems (20)

Deployability
DeployabilityDeployability
Deployability
 
Technology insights: Decision Science Platform
Technology insights: Decision Science PlatformTechnology insights: Decision Science Platform
Technology insights: Decision Science Platform
 
Patterns of enterprise application architecture
Patterns of enterprise application architecturePatterns of enterprise application architecture
Patterns of enterprise application architecture
 
E2 evc 3-2-1-rule - mikeresseler
E2 evc   3-2-1-rule - mikeresselerE2 evc   3-2-1-rule - mikeresseler
E2 evc 3-2-1-rule - mikeresseler
 
Designing Scalable Applications
Designing Scalable ApplicationsDesigning Scalable Applications
Designing Scalable Applications
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
SDN Demystified, by Dean Pemberton [APNIC 38]
SDN Demystified, by Dean Pemberton [APNIC 38]SDN Demystified, by Dean Pemberton [APNIC 38]
SDN Demystified, by Dean Pemberton [APNIC 38]
 
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...Visualizing Your Network Health -  Driving Visibility in Increasingly Complex...
Visualizing Your Network Health - Driving Visibility in Increasingly Complex...
 
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed ServiceCloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
 
Visualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your NetworkVisualizing Your Network Health - Know your Network
Visualizing Your Network Health - Know your Network
 
Deploying at will - SEI
 Deploying at will - SEI Deploying at will - SEI
Deploying at will - SEI
 
Software Defined Networking: Network Virtualization
Software Defined Networking: Network VirtualizationSoftware Defined Networking: Network Virtualization
Software Defined Networking: Network Virtualization
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System
 
Lecture 6 cloud
Lecture 6   cloudLecture 6   cloud
Lecture 6 cloud
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution Blueprint
 
10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System10 Steps to Architecting a Sustainable SCADA System
10 Steps to Architecting a Sustainable SCADA System
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Cloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr AliCloud Computing and Virtualization Overview by Amr Ali
Cloud Computing and Virtualization Overview by Amr Ali
 

More from Len Bass

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
Len Bass
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
Len Bass
 
11 secure development
11  secure development 11  secure development
11 secure development
Len Bass
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery
Len Bass
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline
Len Bass
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management
Len Bass
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architecture
Len Bass
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
Len Bass
 
4 container management
4  container management4  container management
4 container management
Len Bass
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
Len Bass
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machines
Len Bass
 
2 networking
2 networking2 networking
2 networking
Len Bass
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
Len Bass
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
Len Bass
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
Len Bass
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
Len Bass
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
Len Bass
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
Len Bass
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
Len Bass
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool options
Len Bass
 

More from Len Bass (20)

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
 
11 secure development
11  secure development 11  secure development
11 secure development
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architecture
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
 
4 container management
4  container management4  container management
4 container management
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machines
 
2 networking
2 networking2 networking
2 networking
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
 
Packaging tool options
Packaging tool optionsPackaging tool options
Packaging tool options
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Architectural Tactics for Large Scale Systems

  • 1. NICTA Copyright 2012 From imagination to impact Architecture Tactics for Large-Scale Systems to Manage Changes Len Bass
  • 2. NICTA Copyright 2012 From imagination to impact 2 About NICTA National ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners • Providing R&D services, knowledge transfer to Australian (and global) ICT industry NICTA technology is in over 1 billion mobile phones
  • 3. NICTA Copyright 2012 From imagination to impact WICSA 2014 is in Sydney!! Working IEEE/IFIP Conference on Software Architecture (WICSA) is the pre-eminent software architecture conference April 7-11, 2014
  • 4. NICTA Copyright 2012 From imagination to impact Traditional View of Large Scale Systems 4 Application Cloud Environment Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved? End users Developers
  • 5. NICTA Copyright 2012 From imagination to impact A Broader View 5 Application Cloud Environment Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application. Consumer Operator End users Developers
  • 6. NICTA Copyright 2012 From imagination to impact My Message: Applications must respond to change caused by the environment and the operators as well as new processes used during development. Application Cloud Environment Consumer Operator End users Developers .
  • 7. NICTA Copyright 2012 From imagination to impact Applications must be aware of 7 • Failure and its causes • Consistency issues • Continuous deployment practices • Multiple simultaneous versions active • The remainder of this talk will discuss why applications should have this kind of awareness and what tactics are used to address the problems.
  • 8. NICTA Copyright 2012 From imagination to impact Failure and its causes 8 A year in the life of a Google data center (from Jeff Dean) • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) • ~5 racks go wonky (40-80 machines see 50% packetloss) • ~12 router reloads (takes out DNS and external vips for a couple minutes) • ~3 router failures (have to immediately pull traffic for an hour) • ~dozens of minor 30-second blips for dns • ~1000 individual machine failures • ~thousands of hard drive failures • slow disks, bad memory, misconfigured machines, flaky machines, etc.
  • 9. NICTA Copyright 2012 From imagination to impact Consequence for cloud consumers 9 • Failure is pervasive. • Cloud as a whole is reliable (99.5% availability) but any particular physical component is not. • This means applications must be aware of the possibility of virtual machine failure. • Applications must be constructed to be fault tolerant.
  • 10. NICTA Copyright 2012 From imagination to impact Detection of fault 10 • Two techniques – Heartbeat – component sends periodic messaging indicating that it is alive – Timeout – client of component sets a deadline after which • Component will be assumed to have failed. • Messages will be assumed to have gotten lost • Netflix (US video streaming service) advocates fast failure. – Clients set short timeout. – Results in better response time if component failed – May result in “false positive” whereby component is assumed to have failed but, in reality, is still alive. – If client retries request, it may be executed twice.
  • 11. NICTA Copyright 2012 From imagination to impact Recovery from fault 11 • Redundancy of computation and data – Redundancy of data will be discussed in next section on consistency – Redundancy of computation is typically achieved by making services stateless. • Can send failed messages to new instance. Need to be concerned about second execution if first message was, in fact, acted on • Can instantiate new copy of service if failure is caused by overloading. • Alternative means for accomplishing service – Some services can be accomplished in using different mechanisms. Consider one mechanism as a fallback to a primary. – Degraded service might be possible.
  • 12. NICTA Copyright 2012 From imagination to impact Undo • After performing an operation in AWS, may want to go back to original state – i.e. Undo the operation • Not always that straight-forward: – Attaching volume is no problem while the instance is running, detaching might be problematic – Creating / changing auto-scaling rules has effect on number of running instances • Cannot terminate additional instances, as the rule would create new ones! – Deleted / terminated / released resources are gone! 12
  • 13. NICTA Copyright 2012 From imagination to impact Undo using transaction approach 13 + commit + pseudo-delete begin- transaction rollback do do do Administrator
  • 14. NICTA Copyright 2012 From imagination to impact Approach 14 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System
  • 15. NICTA Copyright 2012 From imagination to impact Approach 15 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System Goal state Goal state Initial state Initial state
  • 16. NICTA Copyright 2012 From imagination to impact begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states PlanGenerate codeExecute Administrator Undo System Goal state Goal state Initial state Initial state Set of actions Set of actions Approach 16
  • 17. NICTA Copyright 2012 From imagination to impact Report fault 17 • Through logs. – Correlating logs can be difficult – Tracking logs to root causes can be very difficult. • Through reporting to parent service. – It, in turn, may have alternative means of achieving its goals, including undo.
  • 18. NICTA Copyright 2012 From imagination to impact Consistency issues 18 • Data is frequently replicated. – NoSQL data bases all replicate data • Replication takes time. – Means that inconsistent versions of data may exist • One (or more) that has been updated • One (or more) that has not yet received the updates. – Leads to phenomenon known as “eventual consistency” – May take ½ second to become consistent.
  • 19. NICTA Copyright 2012 From imagination to impact 19 Characterising Eventual Consistency in Amazon SimpleDB • The probability to read updated data in SimpleDB in US West – An application reads data X (ms) after it has written data • SimpleDB has two read operations – Eventual Consistent Read – Consistent Read • This pattern is consistent regardless of the time of day Eventual ConsistentConsistent Read
  • 20. NICTA Copyright 2012 From imagination to impact Other types of inconsistency • Configuration parameters – All instances should have same settings in terms of security, locality, etc. • Synchronization locks. Locks shared across distributed instances may not be in a consistent state. • One mechanism is to have consistency manager. – Complicated since centralized consistency manager may fail and distributed consistency managers must be coordinated. – Zookeeper is an open source tool that manages consistency for distributed applications at a small cost in latency. 20
  • 21. NICTA Copyright 2012 From imagination to impact Continuous deployment practices • Many organizations have developers deploy after changes tested – Google – Amazon – Linkedin – Netflix • Leads to following types of problems – Multiple simultaneous versions active – Errors occurring during installation 21
  • 22. NICTA Copyright 2012 From imagination to impact Various Upgrade Strategies • How many at once? – One at a time (rolling upgrade) – Groups at a time (staged upgrade, e.g. canaries) – All at once (big flip) • What happens to old versions? – Replaced en masse – Maintained for some period for compatibility purposes 22
  • 23. NICTA Copyright 2012 From imagination to impact Services Can be Bundled in Two Fashions • Tightly Coupled – Google – Facebook • Loosely Coupled – Amazon – Linkedin 23
  • 24. NICTA Copyright 2012 From imagination to impact Tightly Coupled Services • Deployment unit is tier • A tier bundles multiple services into one virtual machine • Tier 1 • Tier 2 24
  • 25. NICTA Copyright 2012 From imagination to impact Loosely Coupled Services • Deep service dependency hierarchy – may be 70 deep • Upgrading one service in this hierarchy • Need to consider both service and its clients • Each service is a Virtual Machine 25 Figure from Netflix Tech Blog
  • 26. NICTA Copyright 2012 From imagination to impact Comparing Two Options • Both options provide for horizontal scaling based on load • Both options provide for failure recovery – Tightly coupled option will replace tier – Loosely coupled option will replace service – Failure recovery assumes stateless Virtual Machines • Differ – How updates and canaries are managed (I will discuss in a moment) – How unwanted dependencies are avoided • Tightly coupled option depends on developer discipline • Loosely coupled option avoids unwanted dependencies through information hiding. 26
  • 27. NICTA Copyright 2012 From imagination to impact Common upgrade strategy • Require all versions to be backward compatible with previous versions • Require changes associated with new version to be software switchable. • Clients of a service must be version aware in order to know whether to utilize new functionality. • Once all instances have been upgraded to new versions, send signal to turn on changes both in the new version and their clients. • When using canaries only turn on changes for a subset of services and their clients. 27
  • 28. NICTA Copyright 2012 From imagination to impact Current state of major internet provider • Each service has an owner • Every service instance is instrumented • When a canary is deployed, service owner examines monitoring data (next slide) and uses judgment to decide when to move to production. • Canary testing is currently based on functionality. No stress testing of canaries. 28
  • 29. NICTA Copyright 2012 From imagination to impact Netflix Monitoring Sequence 29 • Client outbound (start/end) • Network (start/end) • Service network (inbound start/end) • Service processing (start/end) • Service outbound (start/end) • Network (start/end) • Client inbound (start/end)
  • 30. NICTA Copyright 2012 From imagination to impact General picture for version aware loosely coupled services Client Top Level load balancer Second level load balancer Server for Version A Server for Version A Server for Version B Second level load balancer Server for Version A Server for Version B 30 Client • Version aware • Must know about new versions In order to take advantage of new functionality • May be implicitly version aware based on, e.g. cluster • Version unaware clients will only use old functionality and these can be served by any server since services are backward compatible. In addition: • Load variation may trigger elasticity rules. • Deciding whether to load new version or old version raises other issues.
  • 31. NICTA Copyright 2012 From imagination to impact Canary Issues • Canaries are a form of live testing. Put a new version into limited production to test its correctness. • Issues – How long are new versions tested to determine correctness? • Period based – for some period of time • Load based – under some utilization assumptions • Result based – until some criteria is met – How are clients of new version chosen and how is this choice enforced? – How are the canaries deployed? 31
  • 32. NICTA Copyright 2012 From imagination to impact Use of canaries with tightly coupled services • Version awareness does not need to extend to load balancers – Services and clients are bound into VM – Services and clients that are used to test new version are in single VM and have no need for version aware load balancers. 32
  • 33. NICTA Copyright 2012 From imagination to impact More Detail on Upgrade Process • Canaries are deployed and allowed to run for a period without turning on new features. • This is to test backward compatibility. • Once canaries pass this test, then the new features are turned on. 33
  • 34. NICTA Copyright 2012 From imagination to impact Installation Motivating Scenario • You change the operating environment for an application – Configuration change – Version change – Hardware change • Result is degraded performance • When the software stack is deep with portions from different suppliers, the result is frequently: 34
  • 35. NICTA Copyright 2012 From imagination to impact Why is Installation Error Prone? • Installation is complicated. – Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for Linux are ~250 pages each – Apache description of addresses and ports (one out of 16 descriptions) has following elements: • Choosing and specifying ports for the server to listen to • IPv4 and IPv6 • Protocols • Virtual Hosts – The number of configuration options that must be set can be large • Hadoop has 206 options • HBase has 64 – Many dependencies are not visible until execution 35
  • 36. NICTA Copyright 2012 From imagination to impact Installation Processes • Processes may be – Undocumented – Out of date – Insufficiently detailed • Our goal is to build process model including error recovery mechanisms 36
  • 37. NICTA Copyright 2012 From imagination to impact Our Activities 37 • Create up to date process models for installation processes. Information sources are – Process discovery from logs – Process formalization from existing written descriptions. • Process descriptions can be used to – Make trade offs – Make recommendations in real time to operations staff – Recommend setting checkpoints for potential later undo, before a risky part of a process is entered – Assist in the detection of errors
  • 38. NICTA Copyright 2012 From imagination to impact Hard Problems 38 • Creating accurate process models – Exception handling mechanisms are not well documented – Noisy logs – Our approach • Top down modeling using process modeling formalism • Bottom up process mining from error logs • Diagnosing errors
  • 39. NICTA Copyright 2012 From imagination to impact Why is Error Diagnosis Hard? In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it. Diagnosis involves correlating messages from • different distributed servers • different portions of the software stack and determining the root cause of the error. The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.
  • 40. NICTA Copyright 2012 From imagination to impact Test Bed 40 Our current test bed is the Hbase stack
  • 41. NICTA Copyright 2012 From imagination to impact Currently Performing Analysis of Configuration Errors 41 • Cross stack errors may take hours to diagnose – Log files are inconsistent – Error message may not give context necessary to determine root cause.
  • 42. NICTA Copyright 2012 From imagination to impact Summary 42 • The modern cloud environment and modern development practices have introduced new problems or made more important old problems. • Tactics exist to deal with some of these problems. • Developing tactics for other problems is a matter of research.
  • 43. NICTA Copyright 2012 From imagination to impact NICTA Team • Anna Liu • Alan Fekete • Min Fu • Jim Zhanwen Li • Qinghua Lu • Sherif Sakr • Hiroshi Wada • Ingo Weber • Xiwei Xu • Liming Zhu 43