AWS fault tolerant architecture

+
Dynamic Fault Tolerant
Applications using AWS
Sumit Kadyan
University Of Victoria

+
Agenda
 Motivation
 How do we design FT web services on AWS
 Research in Load Balancing Algorithms
 Future Study
 Questions!!

+
Motivation
 Not everything on the cloud is fault tolerant!!
 You have to design it to be Fault Tolerant
 AWS offers Dynamic Fault tolerance
 Around 40% of the users using AWS do not deploy any redundancy in
their setup.
 The price involved in using resources on the cloud has fallen by
Roughly 2500% in 7 years.
 AWS service warranty claims 99.95% availability. That‟s around 4
hours downtime in a year.

+
Inherent Fault tolerant components
 Amazon Simple storage (S3)
 Amazon Elastic Load Balancing(ELB)
 Amazon Elastic Compute Cloud(EC2)
 Amazon Elastic Block Store (EBS)
“The above inherit Fault tolerant components provide features
such as AZ, Elastic IP‟s , Snapshots that a Fault Tolerant HA
system must take advantage of and use Correctly” .
Simply said AWS has given you the resources to make HA / FT
applications.

+
AWS Components
 Amazon EC2 (Amazon Elastic Compute
Cloud) :- Web service that provides
computing resources i.e. server
instances to host your software.
 AMI (Amazon Machine Image) :
Template basically contains s/w & h/w
configuration applied to instance type.
 EBS (Elastic Block Store) :- Block Level
storage volumes for EC2‟s. Not
associated with instance. AFR is around
.1 to .5 %.

+
Availability Zones
 Amazon AZ are zones within same region.
 Engineered to be insulated from failures of other AZ‟s.
 Independent Power, cooling, network & security.

+
Elastic IP Addresses
 Public IP addresses that can be
mapped to any EC2 Instance within
a particular EC2 region.
 Addresses are associated with AWS
account and not the instance.
 In case of failure of EC2 Component
, detach Elastic IP from the failed
component and map it to a reserve
EC2.
 Mapping downtime around 1-2 Mins.

+
Auto Scaling
 Auto Scaling enables you to automatically scale up or down the
EC2 capacity.
 You Define your own rules to achieve this. E.g. When no of
running EC2‟s < X , launch Y EC2‟s.
 Use metrics from Amazon CloudWatch to launch/terminate
EC2‟s . E.g. resource utilization above certain threshold.
 E.g. of AS & ELB next ->

+
Elastic Load Balancing
 Elastic Load Balancer distributes
incoming traffic across available EC2
instances.
 Monitors EC2‟s and removes Failed
EC2 resources.
 Works in parallel with Auto Scaling to
provide FT.

+
Implement N+1 Redundancy Auto
Scaling & ELB
 Lets say N=1 .
 Define rule X :- 2 Instances of defined AMI always available.
 ELB distributes load among the 2 servers. Enough capacity for
each server to handle the entire capacity i.e. N=1
 Server 1 Goes down
 Server 2 can process the entire traffic.
 Auto Scaling identifies failure and launches healthy EC2 using
the AMI to fulfill rule X.

+
Fault Tolerance Web Design
 Architecting High Availability in AWS
 High Availability in the Web/App Layer
 High Availability in the Load Balancing Layer
 High Availability in the Database Layer

+
Web/App Layer
 It is a common practice to launch the Web/App layer in more
than one EC2 Instance to avoid SPOF.
 How would user session information be shared between the
EC2 servers?
 It is hence necessary to synchronize session data among EC2
servers.
 Not every user can work with stateless server configurations.

+
Web/App Layer
 Option 1 : JGroups
 Toolkit for reliable messaging
 Can be used by Java based servers.
 Suited for max of around 5-10 EC2‟s.
 Not suited for larger architectures.

+
Web/App server
 Option 3 : RDMS
 Many use it but considered poor design.
 Master will be overwhelmed by session
requests.
 A m1.RDS MySQL Master has max 600
connections. 400 online users will
generate session requests. Only 200
connections left to serve transaction/user
authentication requests.
 Can cause intermittent web service
downtime due to above reason.

+
Web/App Layer
 Option 2:- MemCached
 Highly Used , Supports multiple
platforms.
 Save user session data in multiple
nodes to avoid SPOF (trade off
latency to write to multiple nodes)
 Depending on requirements create
high memory EC2 instances for
MemCached/Elasti Cache.
 Can scale up to tens of thousands of
requests.

+
Load Balancing Layer
 It balances the load among the available EC2 instances.
 SPOF in the LB can bring down the entire site during outage.
 Equally important as replicating servers, databases etc.
 Many ways to build highly available Load balancing Tier.

+
Load Balancing Tier
 Option 1: Elastic Load Balancer
 Inherently Fault Tolerant.
 Automatically distributes incoming traffic
among EC2 Instances.
 Automatically creates more ELB EC2
Instance when load increases to avoid
SPOF.
 Detects health of EC2 Instances and
routes to only healthy instances.

+
ELB Implementation Architecture
Single Server Setup
 Not Recommended , yet most
followed!!
 What is there to balance !!!??
 No fault tolerance benefit.
 SPOF in the terms of LB & EC2
instance.

+
Multi-Server Setup (in AZ)
 HTTP/S requests are directed to EC2
by the ELB.
 Multiple EC2 instances in same AZ
under ELB tier.
 ELB load balances the requests
between the Web/App EC2 instances.

+
ELB with Auto Scaling(inside AZ)
 Web/App Ec2 are configured with
AutoScaling to scale out/down.
 Amazon ELB can direct the load
seamlessly to the EC2 instances
configured with AutoScaling.

+
Multiple AZ’s inside a Region
 Multiple Web/App EC2 instances can
reside across multiple AZ‟s inside a
AWS region.
 ELB is doing multi AZ load balancing.

+
ELB with Amazon AutoScaling
across AZ’s
 EC2 can be configured with
amazon autoscaling to scale
out/down across AZ’s.
 Highly recommended . Highest
Availability offered among all ELB
implementations.

+
Issues with ELB
 Supports only round-robin & sticky session algorithms.
Weighted as of 2013.
 Designed to handle incremental traffic. Sudden Flash traffic can
lead to non availability until scaling up occurs.
 The ELB needs to be “Pre-warmed” to handle sudden traffic.
Currently not configurable from the AWS console.
 Known to be “non – round robin” when requests are generated
from single or specific range of IP‟s.
 Like multiple requests from within a company operating on a
specific range of IP.

+
3rd party Load Balancer
 3rd Party Load Balancers
 Nginx & Haproxy to work as Load
Balancers.
 Use your own scripts to scale up EC2 „s
& LB‟s.
 AutoScaling Works best with ELB.

+
Load Balancing Algorithms
 Random :- Send connection requests to server randomly (Simple
but inefficient)
 Round Robin :- Round Robin passes each new connection request
to next server in line. Eventually distributing connections evenly.
 Weighted Round Robin :- Assign weights to Machines based on the
capacity , no of connections each machine receives depends on
weights.
 More Algos such as Least Connections, Fastest etc.

+
Proposed Research
 A Load Balancing Algorithm that adapts its strategies for
allocating web requests dynamically.
 Prober :- Gather Status info from Web Servers every 50 ms.
 CPU Load on server
 Server‟s response rate
 No of requests served
 Allocator: - Based on prober update , allocator updates weights
allocated.
 The proposed algo differs by considering local & local
information at each web server to choose the best server to
allocate request.

+
Real Time Server Stats Load
Balancing (RTSLB)
Deciding Factors used in algorithm
 Weighted metric of cache hits on different servers.
 CPU Load of Web Server
 Server Response Rate
 No of Clients requests being handled

+
Results
RTSLB outperforms the other Load based algorithms. The difference would
be much higher if the no of connections would increase.

+
Future Study
 Neural Networks based LB algorithms have a promising future.
 Increasing availability by further improving existing LB
Algorithms.
 Studying the results in a cloud environment.

AWS fault tolerant architecture

More Related Content

Similar to AWS fault tolerant architecture

Recently uploaded

AWS fault tolerant architecture