Scalability and Reliability in the Cloud

HIGH SCALABILITY AND
RELIABILITY IN THE
CLOUD
GREG THOMPSON
HEAD OF ARCHITECTURE, APPS ENABLEMENT
ALCATEL-LUCENT

@gmthomp greg.thompson@alcatel-lucent.com

About This Session
 Target audience is backend application
developers deploying infrastructure into a
cloud environment
 Will cover concepts for scalability and
reliability with the goal of helping application
developers understand some key
considerations when designing and building
the backend.

Design Time Decisions
 When first building your application backend,
consider a few important questions
 How fast should the application be recovered if a
failure occurs?
 What kind of down time is acceptable?
 Is the application maintaining stateful data?
 What kind of information needs to be shared across
multiple instances?

What is Scalability?
 Scalability is a term
used to describe
how the application
will handle
increased loads of
traffic volume

Scalability – Factors to Consider
 Horizontal vs. Vertical
 Stateless vs. Stateful
 Understanding Limitations
 Connection Management
 Segmentation of traffic
 Segmentation of responsibility (distributed arch)
 Clustering
 Messaging

What Type of Scalability?
Vertical vs. Horizontal
Vertical Horizontal
 Scaling up a single  Scaling out across
node multiple nodes
 Physical limitations –  Ability to distribute
instances are very
powerful but still have traffic over a number
finite limits of nodes
 Resources such as  Allows for more
number of sockets flexibility over time
can only go so high

Will the App Maintain State?
Stateless Applications
 Application does not
persist information
about transactions Request Respons
e
 Each transaction is
independent and Application
atomic

Will the App Maintain State?
Stateful Applications
 Application needs to
maintain data about
transactions in
First Subseque
progress Request nt
Request

 Requires storage D
Application B
 Persistence may also
be required
depending the

Understanding Limitations
 Thorough testing is
key to understanding
bottlenecks
 Test real-world
scenarios included
latency
 Push the system to
the max to
understand how it

Connection Management
Mobile Device Connections
 Mobile devices don’t always
behave like you expect
 Connectivity is often very
dynamic
 Devices move from 4G/3G/2G/no
G/Wifi
 Not all TCP events will get
reported and sockets can remain
open
 If not handled correctly, these
factors can be time bomb no
matter how vertically you scale a
component

Segmenting Traffic
 Once the application is
able to be scaled out,
traffic can be
segmented in different
ways
 Location (i.e. east coast
vs. west coast)
 Pre-assigned criteria -
User ID, IP, or other
dynamic criteria
 Load Balanced

Segmenting Responsibility
 Segmenting
responsibility allows for
a distributed
architecture
 Each component can be
scaled independently
 Allows for more flexibility
in scaling
 Adds more complexity
and potential messaging
overhead

Clustering
 Clustering is the
concept of having a
group of nodes working App App App App
Nod Nod Nod Nod
together to provide the e e e e
same capability
 Nodes typically co- Share
located d
 Common data shared Data
as needed across the
cluster
 Communication may be
needed between nodes

Messaging
 Once a clustered  Types of Messaging
and/or distributed  JMS
architecture is used  Open Source MQ
messaging will be packages
needed between  Custom Designed
various components  Use of APIs
and/or nodes

Example of Scaled Architecture
Load Load
Load Load
Balancer Balancer
Balancer Balancer

Web Compone Compone Web Compone Compone
Web
Server Compone
nt 1 Compone
nt 2 Web
Server Compone
nt 1 Compone
nt 2
Server nt 1 nt 2 Server nt 1 nt 2

Database Database

Site 1 Site 2

What is Reliability/Availability?
 Availability is typically
measured by the amount of
downtime your application
has in a given year
 Unplanned downtime and
planned downtime are both
considered
 Reliability is described by the
likelihood of failure based on
actual measurements
 We’ll focus more on
Availability

Reliability/Availability
Factors to Consider
 Cost vs. Need
 Problem detection
 Automation for recovery
 Active/standby, active/active, hot standby vs. cold
standby
 Local and Geo-redundancy
 Multi-zone, multi-cloud
 Test Until You Break the System

Reliability Requirements
Cost Considerations Need

 Number of instances  User Experience
 Bandwidth  Customer
requirements requirements
between sites
 Negative Publicity
 Complexity of
software
 Monitoring

Problem Detection
 Effective monitoring of
the application is key to
minimizing downtime
 Event reporting in the
software
 External monitoring –
test for successful
behavior
 Auto detection and
alerting to minimize cost
of operations personnel

Automation for Recovery
 How quickly a failed
component recovers
increases reliability
 Automatic detection
and automatic
recovery
 Automated installation
key for minimizing
setup time during
recovery

Availability Models
 N = number of nodes
required for normal N N
processing
 N+1 = one additional
node to provide N N +1
redundancy in case of
failure
 N+K = K nodes provide N N K K
additional redundancy

Redundancy Models
 Active/Cold Standby Cold
 backup site is booted Active Standb
up when needed y

 Active/Hot Standby
Active
 Backup site is running Active Standb
and ready to takeover y

 Active/Active
 Both sites active and Active Active
processing traffic

Local and Geo-Redundancy
 Local  Geo-Graphic
 Backup instances  Backup instances
are available within are available in
the same location another geo-graphic
location
 Use of availability
 Typically in a
zones within a separate region to
region very similar account for events
such as natural
disasters

Availability to the Max
 Multi-Zone/Multi-  Multi-Cloud
Region
 Ifyour application
 Multi-zone typically
requires the
provide instances
running in different maximum possible
physical locations, but availability
in same region  Run in different
 Multi-region provides cloud providers in
different geographic
regions of availability
different regions

Test Until You Break the System
 Push the system to
the max and observe
the breaking points
 Fix the problem,
repeat
 The best way to find
problems to prevent
unplanned downtime
is to thoroughly test
with a mindset to
break

THANK YOU!
Greg Thompson
@gmthomps
greg.thompson@alcatel-lucent.com

Scalability and Reliability in the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scalability and Reliability in the Cloud

Similar to Scalability and Reliability in the Cloud (20)

Recently uploaded

Recently uploaded (20)

Scalability and Reliability in the Cloud