The art of infrastructure elasticity

The Art of Infrastructure Elasticity
April 28th, 2012

Cloud Developer Conference 2012 , Bangalore

Harish Ganesan
CTO and Co-Founder
8KMiles
Harish11g.AWS@gmail.com

Agenda
• Problem
• Challenges
• Requirements
• Solution Architecture
• Q&A

2

What is the problem scenario ?

3

Big Sales Promotion every quarter by
the Enterprise
4

• Massive online Concurrent Visitors

• Limited processing capacity of the Booking Engine
(~3k requests/sec)

5

• Unhappy Visitors

• More Booking opportunity lost

6

Solution (Step 1):
• Create a Queuing App before the Booking engine
• Efficiently Queue the concurrent visitors

7

Solution (Step 2) :
Moderate and move the visitors waiting in Queuing
app to Booking engine

8

What are the Challenges ?

9

Concurrency
• HTTP/AJAX/REST requests
• Total : 500+ Million requests in 6 hours
• Average :23k+ requests/sec
• Peak : 80K+ requests/sec

10

Queue efficiency
• Allot unique Queue Numbers for visitors

• Queue Number allotment on Fair Basis (As
much possible)

• Reduce the wait time in Queue Number
allotment process

• Reduce overall Queue wait time for the 11
visitor

Load Volatility
Peak utilization during
Compute

Promos

Wasted Capacity

Yearly
Complete under
utilization of Infra other
times

• Massive utilization and under utilization 12
pattern

IP Whitelisting
Public Cloud

3rd Party
Services
IP Address of the
source EC2 Instances
needs to be
whitelisted in 3rd party
Services gateway

• Booking engine needs EC2 IP
Whitelisting for security
13
• Consecutive IP range needed

Variety of OS / Software’s
• RedHat OS for Load Balancer , NoSQL and
Queue Layer

• Apache Tomcat Java web/App Layer

• CentOS for Processing Programs

• MySQL for Result storage
14
• Hadoop for Analytics

What are the requirements from
enterprise ?

15

Requirements
• Elastic Infrastructure
• Create the Infrastructure 2 hrs before the
promo
• Tear down infrastructure 2 hrs after the promo
• Elastically expand the infra during the promo

• Highly Scalable and Available

• Log Analytics
16
• Complete Infrastructure Automation

Solution Architecture

17

Option 1: Single Queue ( Initial thought)

Queuing
Application
Booking
Concurrent
Engine
visitors
18

Option 2: Parallel Queue ( Recommended)

Booking
Concurrent Queuing Engine
visitors Application
19

Request types
• Customer Visit is a HTTP request to the
Queuing Application

• Current Visitor Queue position is a AJAX
call every X seconds to the Queuing
Application
• More Wait ~ More Calls

20

Solution Step 1 : The Cloud ?

• Amazon Web Services

• We had 4+ years Architecture experience in AWS

• It satisfied many customer requirements and 21
challenges in this use case

Solution Step 2 : R53/NW

Amazon Virtual Private Cloud

Users
Amazon
Route 53

EC2 Instances
on AWS
VPC Subnet 1 VPC Subnet 2
Availability Zone 1 Availability Zone 2
Users

• Amazon VPC with Multi-AZ subnet
configurations ( HA )
• Amazon Route 53 for Managed DNS
22
• DNS RR algorithm at Route53

Solution Step 3 : Load Balancing


Users
Amazon
Route 53

EBS M1.large EBS M1.large
Elastic IP Elastic IP
Volumes Volumes

HAProxy EC2 Instance –1 HAProxy EC2 Instance –2
Users
Round Robin Round Robin
Algorithm Algorithm

VPC Subnet 1
23
Availability Zone 1

Solution Step 3: Load Balancing
• HAProxy vs Amazon ELB

• Custom programs to Auto Scale HAProxy

• HAProxy Elastic -> Attach / Detach from
Route53

• HAProxy IP whitelisting in 3rd party Gateway

• 16 HAProxy Instances , 2 AZ’s , 2 Subnets

• RR Load Balancing algorithm 24

Solution Step 4 : Web/App Servers


Users
Amazon
Route 53
HA Proxy EC2 Instance-1

Round Robin
Algorithm
Users

EBS C1.Xlarge
Elastic IP Web/App 2 Web/App 3
Volumes

Web/App EC2 Instance –1

VPC Subnet 1
25
Availability Zone 1

Solution Step 4: Web/App Servers
• 3 Web/App instances under every HAProxy

• C1.Xlarge Instance Type for Web/App Instances

• Custom programs to Auto Scale C1.Xlarge

• Automatic Attach / Detach from HAProxy

• Every web/App Instance with EIP for IP
whitelisting

• 48 Web/App EC2 Instances spread across 2 AZ’s 26

Solution Step 5 : Queue Servers

HA Proxy EC2 Instance-1

Users
Amazon
Route 53 Round Robin
Algorithm

Users
Web/App 1 Web/App 2 Web/App 3

EBS m1.large
Volumes

RabbitMQ VPC Subnet 1 27
Availability Zone 1

Solution Step 5: Queue Servers
• RabbitMQ vs Amazon SQS

• FIFO/Concurrency/No Duplicate messages

• 1 RabbitMQ instance for queuing every
sector

• M1. large Instance Type

• 16 RabbitMQ Instances overall 28

Solution Step 6 : Processors/Redis
Amazon Single Sector View Components of
Route 53 Single Sector
1
1. One HAProxy
2. Three Web/App
HA Proxy
3. One RabbitMQ
4. One BG
Round Robin Processor Node
2 Algorithm
5. Two Redis

Sector is not an
AWS term , it is
8KMiles term for
Web/App 1 Web/App 2 Web/App 3 Logical EC2
3
instance groups for
this use case

RabbitMQ
4

5
Redis Master
29
Processors 6 7
Processors
Redis Slave Booking Engine

Solution Step 6: Redis
• Redis vs Amazon DynamoDB

• Redis : NoSQL KV Data store

• Visitors are shown their Current Queue
position every X seconds from Redis

• 1 Redis Master-Slave instance for every sector

• M1. large Instance Type for Redis
30
• 32 Redis Instances overall

Solution Step 6: Processors
• BG Processors : Java Programs to

• RabbitMq -> Redis : Allot Queue numbers to visitor
requests and insert to Redis

• Redis -> Booking Engine : Moderate the movement of
queued visitors from Redis to Booking Engine

• Process the Response Status / Booking Status / Inactive
Visitors / Timeouts

• 2 BG Processor node per sector

• CPU intensive : C1.Xlarge Instance Type
31

• 32 BG Processor Instances overall

Overall Solution Architecture
Sector is not an AWS
term , it is 8KMiles term
for Logical EC2 instance Amazon
groups for this use case Route 53

Sector 1 2 3 4 5 .. .. 16

HAProxy

Web/App

RabbitMQ

Redis

BG Programs 32

Booking Engine

Scalability
AZ-1 Amazon Virtual Private Cloud
AZ-2
Sector -1 Sector -3

Amazon
Route 53

EC2 Instances EC2 Instances EC2 Instances EC2 Instances

VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2
Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2

Sector -2 Sector -4



Scalability
• New sectors containing LB, Web, Queue ,
NoSQL , BG stack will be created
automatically depending upon the load
• Same AZ or multi-AZ can be specified for the
creation
• CloudWatch Custom parameters used
• Automated Java Programs were used for the
sector creation
• No Manual intervention needed
34

High Availability @ Instance level
AZ-2

Amazon
Route 53





High Availability @ Instance
• HA built @ Web/App , Redis and BG
processor instances
• Any Failure / Non responsive EC2 instances
will be automatically detected/replaced by
Java programs
• No Manual intervention needed

36

High Availability @ Sector level
AZ-2
Sector -1 Sector -2 Sector -5 Sector -3
Amazon
Route 53



Sector -6 Sector -4



High Availability @ Sector level
• Any Failure / Non responsive instances inside
Sectors will be automatically
detected/replaced by Java programs
• If sector-3 fails , still other sectors will be
active and can take requests

38

High Availability @ AZ Level
AZ-2

Amazon
Route 53





High Availability @ AZ level
• If entire AZ-2 fails then load will be balanced
to instances in AZ-1
• Automated programs will create new sectors
inside AZ-1 to handle the load

40

Log Analytics
HDFS Cluster

1 2 3
EC2 S3 RDS
Instances Bucket MySQL
with logs Elastic Map Reduce
Jobs

• Redis , Web/App , HAProxy , RBQ logs synced to S3

• Elastic MapReduce Jobs to process / analyze the logs

• Processed result moved to RDS MySQL for reports/ 41
Visualizations

Monitoring
• Nagios + Puppet (combined) for Auto
scaled monitoring infra and deployment

• CloudWatch Custom metrics / Tomcat
Valve/ Automated Java Programs for EC2

42

Backup
• No backups -> only Syncs to S3

• Golden AMI’s snapshot to S3

• Periodic Sync of data between EC2 and S3

• Periodic log Sync between Web/App to S3

43

Infrastructure
• Amazon Route53
• Amazon VPC – Public , Private subnet
• 150+ EC2 instances , 2 AZ’s , 1 Region
• 70+ Elastic IP’s
• 200+ EBS
• S3 buckets
• Suite of monitoring tools
• 1 Puppet Server
• Amazon CloudWatch
44
• Amazon CloudFront

Infrastructure Elasticity
• Entire Infra created 2 hrs before promo
• Tear down infra 2 hrs after promo
• ~30 Mins to launch the infra in AWS
• ~45 Mins to tear down
• Automated Failure detection/rectification
• Automated Programs for Infra creation

45

Infrastructure Cost
• ~10K USD per promo
• Not inclusive of Data charges

• Unthinkable Savings
• Visitor experience was good
• More Bookings per Promo

Power of Elasticity is Simply priceless
46
AWS is “AWSome”

If you need help in architecting Highly Elastic
solutions on AWS?

Leave it to the experts , we will
handle this

Cloud Architecture Consulting
Cloud Application Development
Cloud Migration & Implementation
Cloud Adoption Strategy

“Let's get the job done”

Q&A
Harish11g.aws@gmail.com
http://in.linkedin.com/in/harishganesan
www.twitter.com/harish11g
http://harish11g.blogspot.com

Amazon Web Services
aws.amazon.com
aws.amazon.com/contact-us/aws-sales
49

The art of infrastructure elasticity

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The art of infrastructure elasticity

Similar to The art of infrastructure elasticity (20)

Recently uploaded

Recently uploaded (20)

The art of infrastructure elasticity