+
Dynamic Fault Tolerant
Applications using AWS
Sumit Kadyan
University Of Victoria
+
Agenda
 Motivation
 How do we design FT web services on AWS
 Research in Load Balancing Algorithms
 Future Study
 Questions!!
+
Motivation
 Not everything on the cloud is fault tolerant!!
 You have to design it to be Fault Tolerant
 AWS offers Dynamic Fault tolerance
 Around 40% of the users using AWS do not deploy any redundancy in
their setup.
 The price involved in using resources on the cloud has fallen by
Roughly 2500% in 7 years.
 AWS service warranty claims 99.95% availability. That‟s around 4
hours downtime in a year.
+
Inherent Fault tolerant components
 Amazon Simple storage (S3)
 Amazon Elastic Load Balancing(ELB)
 Amazon Elastic Compute Cloud(EC2)
 Amazon Elastic Block Store (EBS)
“The above inherit Fault tolerant components provide features
such as AZ, Elastic IP‟s , Snapshots that a Fault Tolerant HA
system must take advantage of and use Correctly” .
Simply said AWS has given you the resources to make HA / FT
applications.
+
AWS Components
 Amazon EC2 (Amazon Elastic Compute
Cloud) :- Web service that provides
computing resources i.e. server
instances to host your software.
 AMI (Amazon Machine Image) :
Template basically contains s/w & h/w
configuration applied to instance type.
 EBS (Elastic Block Store) :- Block Level
storage volumes for EC2‟s. Not
associated with instance. AFR is around
.1 to .5 %.
+
Availability Zones
 Amazon AZ are zones within same region.
 Engineered to be insulated from failures of other AZ‟s.
 Independent Power, cooling, network & security.
+
Elastic IP Addresses
 Public IP addresses that can be
mapped to any EC2 Instance within
a particular EC2 region.
 Addresses are associated with AWS
account and not the instance.
 In case of failure of EC2 Component
, detach Elastic IP from the failed
component and map it to a reserve
EC2.
 Mapping downtime around 1-2 Mins.
+
Auto Scaling
 Auto Scaling enables you to automatically scale up or down the
EC2 capacity.
 You Define your own rules to achieve this. E.g. When no of
running EC2‟s < X , launch Y EC2‟s.
 Use metrics from Amazon CloudWatch to launch/terminate
EC2‟s . E.g. resource utilization above certain threshold.
 E.g. of AS & ELB next ->
+
Elastic Load Balancing
 Elastic Load Balancer distributes
incoming traffic across available EC2
instances.
 Monitors EC2‟s and removes Failed
EC2 resources.
 Works in parallel with Auto Scaling to
provide FT.
+
Implement N+1 Redundancy Auto
Scaling & ELB
 Lets say N=1 .
 Define rule X :- 2 Instances of defined AMI always available.
 ELB distributes load among the 2 servers. Enough capacity for
each server to handle the entire capacity i.e. N=1
 Server 1 Goes down
 Server 2 can process the entire traffic.
 Auto Scaling identifies failure and launches healthy EC2 using
the AMI to fulfill rule X.
+
Fault Tolerance Web Design
 Architecting High Availability in AWS
 High Availability in the Web/App Layer
 High Availability in the Load Balancing Layer
 High Availability in the Database Layer
+
Web/App Layer
 It is a common practice to launch the Web/App layer in more
than one EC2 Instance to avoid SPOF.
 How would user session information be shared between the
EC2 servers?
 It is hence necessary to synchronize session data among EC2
servers.
 Not every user can work with stateless server configurations.
+
Web/App Layer
+
Web/App Layer
 Option 1 : JGroups
 Toolkit for reliable messaging
 Can be used by Java based servers.
 Suited for max of around 5-10 EC2‟s.
 Not suited for larger architectures.
+
Web/App server
 Option 3 : RDMS
 Many use it but considered poor design.
 Master will be overwhelmed by session
requests.
 A m1.RDS MySQL Master has max 600
connections. 400 online users will
generate session requests. Only 200
connections left to serve transaction/user
authentication requests.
 Can cause intermittent web service
downtime due to above reason.
+
Web/App Layer
 Option 2:- MemCached
 Highly Used , Supports multiple
platforms.
 Save user session data in multiple
nodes to avoid SPOF (trade off
latency to write to multiple nodes)
 Depending on requirements create
high memory EC2 instances for
MemCached/Elasti Cache.
 Can scale up to tens of thousands of
requests.
+
Load Balancing Layer
 It balances the load among the available EC2 instances.
 SPOF in the LB can bring down the entire site during outage.
 Equally important as replicating servers, databases etc.
 Many ways to build highly available Load balancing Tier.
+
Load Balancing Tier
 Option 1: Elastic Load Balancer
 Inherently Fault Tolerant.
 Automatically distributes incoming traffic
among EC2 Instances.
 Automatically creates more ELB EC2
Instance when load increases to avoid
SPOF.
 Detects health of EC2 Instances and
routes to only healthy instances.
+
ELB Implementation Architecture
Single Server Setup
 Not Recommended , yet most
followed!!
 What is there to balance !!!??
 No fault tolerance benefit.
 SPOF in the terms of LB & EC2
instance.
+
ELB Implementation Architecture
Multi-Server Setup (in AZ)
 HTTP/S requests are directed to EC2
by the ELB.
 Multiple EC2 instances in same AZ
under ELB tier.
 ELB load balances the requests
between the Web/App EC2 instances.
+
ELB Implementation Architecture
ELB with Auto Scaling(inside AZ)
 Web/App Ec2 are configured with
AutoScaling to scale out/down.
 Amazon ELB can direct the load
seamlessly to the EC2 instances
configured with AutoScaling.
+
ELB Implementation Architecture
Multiple AZ’s inside a Region
 Multiple Web/App EC2 instances can
reside across multiple AZ‟s inside a
AWS region.
 ELB is doing multi AZ load balancing.
+
ELB Implementation Architecture
ELB with Amazon AutoScaling
across AZ’s
 EC2 can be configured with
amazon autoscaling to scale
out/down across AZ’s.
 Highly recommended . Highest
Availability offered among all ELB
implementations.
+
Issues with ELB
 Supports only round-robin & sticky session algorithms.
Weighted as of 2013.
 Designed to handle incremental traffic. Sudden Flash traffic can
lead to non availability until scaling up occurs.
 The ELB needs to be “Pre-warmed” to handle sudden traffic.
Currently not configurable from the AWS console.
 Known to be “non – round robin” when requests are generated
from single or specific range of IP‟s.
 Like multiple requests from within a company operating on a
specific range of IP.
+
3rd party Load Balancer
 3rd Party Load Balancers
 Nginx & Haproxy to work as Load
Balancers.
 Use your own scripts to scale up EC2 „s
& LB‟s.
 AutoScaling Works best with ELB.
+
Load Balancing Algorithms
 Random :- Send connection requests to server randomly (Simple
but inefficient)
 Round Robin :- Round Robin passes each new connection request
to next server in line. Eventually distributing connections evenly.
 Weighted Round Robin :- Assign weights to Machines based on the
capacity , no of connections each machine receives depends on
weights.
 More Algos such as Least Connections, Fastest etc.
+
Proposed Research
 A Load Balancing Algorithm that adapts its strategies for
allocating web requests dynamically.
 Prober :- Gather Status info from Web Servers every 50 ms.
 CPU Load on server
 Server‟s response rate
 No of requests served
 Allocator: - Based on prober update , allocator updates weights
allocated.
 The proposed algo differs by considering local & local
information at each web server to choose the best server to
allocate request.
+
Real Time Server Stats Load
Balancing (RTSLB)
Deciding Factors used in algorithm
 Weighted metric of cache hits on different servers.
 CPU Load of Web Server
 Server Response Rate
 No of Clients requests being handled
+
Architecture
+
Algorithm
+
Results
RTSLB outperforms the other Load based algorithms. The difference would
be much higher if the no of connections would increase.
+
Future Study
 Neural Networks based LB algorithms have a promising future.
 Increasing availability by further improving existing LB
Algorithms.
 Studying the results in a cloud environment.
+
Questions

AWS fault tolerant architecture

  • 1.
    + Dynamic Fault Tolerant Applicationsusing AWS Sumit Kadyan University Of Victoria
  • 2.
    + Agenda  Motivation  Howdo we design FT web services on AWS  Research in Load Balancing Algorithms  Future Study  Questions!!
  • 3.
    + Motivation  Not everythingon the cloud is fault tolerant!!  You have to design it to be Fault Tolerant  AWS offers Dynamic Fault tolerance  Around 40% of the users using AWS do not deploy any redundancy in their setup.  The price involved in using resources on the cloud has fallen by Roughly 2500% in 7 years.  AWS service warranty claims 99.95% availability. That‟s around 4 hours downtime in a year.
  • 4.
    + Inherent Fault tolerantcomponents  Amazon Simple storage (S3)  Amazon Elastic Load Balancing(ELB)  Amazon Elastic Compute Cloud(EC2)  Amazon Elastic Block Store (EBS) “The above inherit Fault tolerant components provide features such as AZ, Elastic IP‟s , Snapshots that a Fault Tolerant HA system must take advantage of and use Correctly” . Simply said AWS has given you the resources to make HA / FT applications.
  • 5.
    + AWS Components  AmazonEC2 (Amazon Elastic Compute Cloud) :- Web service that provides computing resources i.e. server instances to host your software.  AMI (Amazon Machine Image) : Template basically contains s/w & h/w configuration applied to instance type.  EBS (Elastic Block Store) :- Block Level storage volumes for EC2‟s. Not associated with instance. AFR is around .1 to .5 %.
  • 6.
    + Availability Zones  AmazonAZ are zones within same region.  Engineered to be insulated from failures of other AZ‟s.  Independent Power, cooling, network & security.
  • 7.
    + Elastic IP Addresses Public IP addresses that can be mapped to any EC2 Instance within a particular EC2 region.  Addresses are associated with AWS account and not the instance.  In case of failure of EC2 Component , detach Elastic IP from the failed component and map it to a reserve EC2.  Mapping downtime around 1-2 Mins.
  • 8.
    + Auto Scaling  AutoScaling enables you to automatically scale up or down the EC2 capacity.  You Define your own rules to achieve this. E.g. When no of running EC2‟s < X , launch Y EC2‟s.  Use metrics from Amazon CloudWatch to launch/terminate EC2‟s . E.g. resource utilization above certain threshold.  E.g. of AS & ELB next ->
  • 9.
    + Elastic Load Balancing Elastic Load Balancer distributes incoming traffic across available EC2 instances.  Monitors EC2‟s and removes Failed EC2 resources.  Works in parallel with Auto Scaling to provide FT.
  • 10.
    + Implement N+1 RedundancyAuto Scaling & ELB  Lets say N=1 .  Define rule X :- 2 Instances of defined AMI always available.  ELB distributes load among the 2 servers. Enough capacity for each server to handle the entire capacity i.e. N=1  Server 1 Goes down  Server 2 can process the entire traffic.  Auto Scaling identifies failure and launches healthy EC2 using the AMI to fulfill rule X.
  • 11.
    + Fault Tolerance WebDesign  Architecting High Availability in AWS  High Availability in the Web/App Layer  High Availability in the Load Balancing Layer  High Availability in the Database Layer
  • 12.
    + Web/App Layer  Itis a common practice to launch the Web/App layer in more than one EC2 Instance to avoid SPOF.  How would user session information be shared between the EC2 servers?  It is hence necessary to synchronize session data among EC2 servers.  Not every user can work with stateless server configurations.
  • 13.
  • 14.
    + Web/App Layer  Option1 : JGroups  Toolkit for reliable messaging  Can be used by Java based servers.  Suited for max of around 5-10 EC2‟s.  Not suited for larger architectures.
  • 15.
    + Web/App server  Option3 : RDMS  Many use it but considered poor design.  Master will be overwhelmed by session requests.  A m1.RDS MySQL Master has max 600 connections. 400 online users will generate session requests. Only 200 connections left to serve transaction/user authentication requests.  Can cause intermittent web service downtime due to above reason.
  • 16.
    + Web/App Layer  Option2:- MemCached  Highly Used , Supports multiple platforms.  Save user session data in multiple nodes to avoid SPOF (trade off latency to write to multiple nodes)  Depending on requirements create high memory EC2 instances for MemCached/Elasti Cache.  Can scale up to tens of thousands of requests.
  • 17.
    + Load Balancing Layer It balances the load among the available EC2 instances.  SPOF in the LB can bring down the entire site during outage.  Equally important as replicating servers, databases etc.  Many ways to build highly available Load balancing Tier.
  • 18.
    + Load Balancing Tier Option 1: Elastic Load Balancer  Inherently Fault Tolerant.  Automatically distributes incoming traffic among EC2 Instances.  Automatically creates more ELB EC2 Instance when load increases to avoid SPOF.  Detects health of EC2 Instances and routes to only healthy instances.
  • 19.
    + ELB Implementation Architecture SingleServer Setup  Not Recommended , yet most followed!!  What is there to balance !!!??  No fault tolerance benefit.  SPOF in the terms of LB & EC2 instance.
  • 20.
    + ELB Implementation Architecture Multi-ServerSetup (in AZ)  HTTP/S requests are directed to EC2 by the ELB.  Multiple EC2 instances in same AZ under ELB tier.  ELB load balances the requests between the Web/App EC2 instances.
  • 21.
    + ELB Implementation Architecture ELBwith Auto Scaling(inside AZ)  Web/App Ec2 are configured with AutoScaling to scale out/down.  Amazon ELB can direct the load seamlessly to the EC2 instances configured with AutoScaling.
  • 22.
    + ELB Implementation Architecture MultipleAZ’s inside a Region  Multiple Web/App EC2 instances can reside across multiple AZ‟s inside a AWS region.  ELB is doing multi AZ load balancing.
  • 23.
    + ELB Implementation Architecture ELBwith Amazon AutoScaling across AZ’s  EC2 can be configured with amazon autoscaling to scale out/down across AZ’s.  Highly recommended . Highest Availability offered among all ELB implementations.
  • 24.
    + Issues with ELB Supports only round-robin & sticky session algorithms. Weighted as of 2013.  Designed to handle incremental traffic. Sudden Flash traffic can lead to non availability until scaling up occurs.  The ELB needs to be “Pre-warmed” to handle sudden traffic. Currently not configurable from the AWS console.  Known to be “non – round robin” when requests are generated from single or specific range of IP‟s.  Like multiple requests from within a company operating on a specific range of IP.
  • 25.
    + 3rd party LoadBalancer  3rd Party Load Balancers  Nginx & Haproxy to work as Load Balancers.  Use your own scripts to scale up EC2 „s & LB‟s.  AutoScaling Works best with ELB.
  • 26.
    + Load Balancing Algorithms Random :- Send connection requests to server randomly (Simple but inefficient)  Round Robin :- Round Robin passes each new connection request to next server in line. Eventually distributing connections evenly.  Weighted Round Robin :- Assign weights to Machines based on the capacity , no of connections each machine receives depends on weights.  More Algos such as Least Connections, Fastest etc.
  • 27.
    + Proposed Research  ALoad Balancing Algorithm that adapts its strategies for allocating web requests dynamically.  Prober :- Gather Status info from Web Servers every 50 ms.  CPU Load on server  Server‟s response rate  No of requests served  Allocator: - Based on prober update , allocator updates weights allocated.  The proposed algo differs by considering local & local information at each web server to choose the best server to allocate request.
  • 28.
    + Real Time ServerStats Load Balancing (RTSLB) Deciding Factors used in algorithm  Weighted metric of cache hits on different servers.  CPU Load of Web Server  Server Response Rate  No of Clients requests being handled
  • 29.
  • 30.
  • 31.
    + Results RTSLB outperforms theother Load based algorithms. The difference would be much higher if the no of connections would increase.
  • 32.
    + Future Study  NeuralNetworks based LB algorithms have a promising future.  Increasing availability by further improving existing LB Algorithms.  Studying the results in a cloud environment.
  • 33.