From AT&T Bootstrap Week: This session focuses on architecture and design concepts to ensure scalability and maximize reliability for server-based applications running in the cloud environment. The session will discuss techniques to consider for achieving scalability and reliability and tradeoffs to consider such as time vs. cost based on the needs for different types of applications.
1. HIGH SCALABILITY AND
RELIABILITY IN THE
CLOUD
GREG THOMPSON
HEAD OF ARCHITECTURE, APPS ENABLEMENT
ALCATEL-LUCENT
@gmthomp greg.thompson@alcatel-lucent.com
2. About This Session
Target audience is backend application
developers deploying infrastructure into a
cloud environment
Will cover concepts for scalability and
reliability with the goal of helping application
developers understand some key
considerations when designing and building
the backend.
3. Design Time Decisions
When first building your application backend,
consider a few important questions
How fast should the application be recovered if a
failure occurs?
What kind of down time is acceptable?
Is the application maintaining stateful data?
What kind of information needs to be shared across
multiple instances?
5. What is Scalability?
Scalability is a term
used to describe
how the application
will handle
increased loads of
traffic volume
6. Scalability – Factors to Consider
Horizontal vs. Vertical
Stateless vs. Stateful
Understanding Limitations
Connection Management
Segmentation of traffic
Segmentation of responsibility (distributed arch)
Clustering
Messaging
7. What Type of Scalability?
Vertical vs. Horizontal
Vertical Horizontal
Scaling up a single Scaling out across
node multiple nodes
Physical limitations – Ability to distribute
instances are very
powerful but still have traffic over a number
finite limits of nodes
Resources such as Allows for more
number of sockets flexibility over time
can only go so high
8. Will the App Maintain State?
Stateless Applications
Application does not
persist information
about transactions Request Respons
e
Each transaction is
independent and Application
atomic
9. Will the App Maintain State?
Stateful Applications
Application needs to
maintain data about
transactions in
First Subseque
progress Request nt
Request
Requires storage D
Application B
Persistence may also
be required
depending the
10. Understanding Limitations
Thorough testing is
key to understanding
bottlenecks
Test real-world
scenarios included
latency
Push the system to
the max to
understand how it
11. Connection Management
Mobile Device Connections
Mobile devices don’t always
behave like you expect
Connectivity is often very
dynamic
Devices move from 4G/3G/2G/no
G/Wifi
Not all TCP events will get
reported and sockets can remain
open
If not handled correctly, these
factors can be time bomb no
matter how vertically you scale a
component
12. Segmenting Traffic
Once the application is
able to be scaled out,
traffic can be
segmented in different
ways
Location (i.e. east coast
vs. west coast)
Pre-assigned criteria -
User ID, IP, or other
dynamic criteria
Load Balanced
13. Segmenting Responsibility
Segmenting
responsibility allows for
a distributed
architecture
Each component can be
scaled independently
Allows for more flexibility
in scaling
Adds more complexity
and potential messaging
overhead
14. Clustering
Clustering is the
concept of having a
group of nodes working App App App App
Nod Nod Nod Nod
together to provide the e e e e
same capability
Nodes typically co- Share
located d
Common data shared Data
as needed across the
cluster
Communication may be
needed between nodes
15. Messaging
Once a clustered Types of Messaging
and/or distributed JMS
architecture is used Open Source MQ
messaging will be packages
needed between Custom Designed
various components Use of APIs
and/or nodes
16. Example of Scaled Architecture
Load Load
Load Load
Balancer Balancer
Balancer Balancer
Web Compone Compone Web Compone Compone
Web
Server Compone
nt 1 Compone
nt 2 Web
Server Compone
nt 1 Compone
nt 2
Server nt 1 nt 2 Server nt 1 nt 2
Database Database
Site 1 Site 2
18. What is Reliability/Availability?
Availability is typically
measured by the amount of
downtime your application
has in a given year
Unplanned downtime and
planned downtime are both
considered
Reliability is described by the
likelihood of failure based on
actual measurements
We’ll focus more on
Availability
19. Reliability/Availability
Factors to Consider
Cost vs. Need
Problem detection
Automation for recovery
Active/standby, active/active, hot standby vs. cold
standby
Local and Geo-redundancy
Multi-zone, multi-cloud
Test Until You Break the System
20. Reliability Requirements
Cost Considerations Need
Number of instances User Experience
Bandwidth Customer
requirements requirements
between sites
Negative Publicity
Complexity of
software
Monitoring
21. Problem Detection
Effective monitoring of
the application is key to
minimizing downtime
Event reporting in the
software
External monitoring –
test for successful
behavior
Auto detection and
alerting to minimize cost
of operations personnel
22. Automation for Recovery
How quickly a failed
component recovers
increases reliability
Automatic detection
and automatic
recovery
Automated installation
key for minimizing
setup time during
recovery
23. Availability Models
N = number of nodes
required for normal N N
processing
N+1 = one additional
node to provide N N +1
redundancy in case of
failure
N+K = K nodes provide N N K K
additional redundancy
24. Redundancy Models
Active/Cold Standby Cold
backup site is booted Active Standb
up when needed y
Active/Hot Standby
Active
Backup site is running Active Standb
and ready to takeover y
Active/Active
Both sites active and Active Active
processing traffic
25. Local and Geo-Redundancy
Local Geo-Graphic
Backup instances Backup instances
are available within are available in
the same location another geo-graphic
location
Use of availability
Typically in a
zones within a separate region to
region very similar account for events
such as natural
disasters
26. Availability to the Max
Multi-Zone/Multi- Multi-Cloud
Region
Ifyour application
Multi-zone typically
requires the
provide instances
running in different maximum possible
physical locations, but availability
in same region Run in different
Multi-region provides cloud providers in
different geographic
regions of availability
different regions
27. Test Until You Break the System
Push the system to
the max and observe
the breaking points
Fix the problem,
repeat
The best way to find
problems to prevent
unplanned downtime
is to thoroughly test
with a mindset to
break