1. The Art of Infrastructure Elasticity
April 28th, 2012
Cloud Developer Conference 2012 , Bangalore
Harish Ganesan
CTO and Co-Founder
8KMiles
Harish11g.AWS@gmail.com
11. Queue efficiency
• Allot unique Queue Numbers for visitors
• Queue Number allotment on Fair Basis (As
much possible)
• Reduce the wait time in Queue Number
allotment process
• Reduce overall Queue wait time for the 11
visitor
12. Load Volatility
Peak utilization during
Compute
Promos
Wasted Capacity
Yearly
Complete under
utilization of Infra other
times
• Massive utilization and under utilization 12
pattern
13. IP Whitelisting
Public Cloud
3rd Party
Services
IP Address of the
source EC2 Instances
needs to be
whitelisted in 3rd party
Services gateway
• Booking engine needs EC2 IP
Whitelisting for security
13
• Consecutive IP range needed
14. Variety of OS / Software’s
• RedHat OS for Load Balancer , NoSQL and
Queue Layer
• Apache Tomcat Java web/App Layer
• CentOS for Processing Programs
• MySQL for Result storage
14
• Hadoop for Analytics
16. Requirements
• Elastic Infrastructure
• Create the Infrastructure 2 hrs before the
promo
• Tear down infrastructure 2 hrs after the promo
• Elastically expand the infra during the promo
• Highly Scalable and Available
• Log Analytics
16
• Complete Infrastructure Automation
20. Request types
• Customer Visit is a HTTP request to the
Queuing Application
• Current Visitor Queue position is a AJAX
call every X seconds to the Queuing
Application
• More Wait ~ More Calls
20
21. Solution Step 1 : The Cloud ?
• Amazon Web Services
• We had 4+ years Architecture experience in AWS
• It satisfied many customer requirements and 21
challenges in this use case
22. Solution Step 2 : R53/NW
Amazon Virtual Private Cloud
Users
Amazon
Route 53
EC2 Instances
on AWS
VPC Subnet 1 VPC Subnet 2
Availability Zone 1 Availability Zone 2
Users
• Amazon VPC with Multi-AZ subnet
configurations ( HA )
• Amazon Route 53 for Managed DNS
22
• DNS RR algorithm at Route53
23. Solution Step 3 : Load Balancing
Amazon Virtual Private Cloud
Users
Amazon
Route 53
EBS M1.large EBS M1.large
Elastic IP Elastic IP
Volumes Volumes
HAProxy EC2 Instance –1 HAProxy EC2 Instance –2
Users
Round Robin Round Robin
Algorithm Algorithm
VPC Subnet 1
23
Availability Zone 1
24. Solution Step 3: Load Balancing
• HAProxy vs Amazon ELB
• Custom programs to Auto Scale HAProxy
• HAProxy Elastic -> Attach / Detach from
Route53
• HAProxy IP whitelisting in 3rd party Gateway
• 16 HAProxy Instances , 2 AZ’s , 2 Subnets
• RR Load Balancing algorithm 24
26. Solution Step 4: Web/App Servers
• 3 Web/App instances under every HAProxy
• C1.Xlarge Instance Type for Web/App Instances
• Custom programs to Auto Scale C1.Xlarge
• Automatic Attach / Detach from HAProxy
• Every web/App Instance with EIP for IP
whitelisting
• 48 Web/App EC2 Instances spread across 2 AZ’s 26
28. Solution Step 5: Queue Servers
• RabbitMQ vs Amazon SQS
• FIFO/Concurrency/No Duplicate messages
• 1 RabbitMQ instance for queuing every
sector
• M1. large Instance Type
• 16 RabbitMQ Instances overall 28
29. Solution Step 6 : Processors/Redis
Amazon Single Sector View Components of
Route 53 Single Sector
1
1. One HAProxy
2. Three Web/App
HA Proxy
3. One RabbitMQ
4. One BG
Round Robin Processor Node
2 Algorithm
5. Two Redis
Sector is not an
AWS term , it is
8KMiles term for
Web/App 1 Web/App 2 Web/App 3 Logical EC2
3
instance groups for
this use case
RabbitMQ
4
5
Redis Master
29
Processors 6 7
Processors
Redis Slave Booking Engine
30. Solution Step 6: Redis
• Redis vs Amazon DynamoDB
• Redis : NoSQL KV Data store
• Visitors are shown their Current Queue
position every X seconds from Redis
• 1 Redis Master-Slave instance for every sector
• M1. large Instance Type for Redis
30
• 32 Redis Instances overall
31. Solution Step 6: Processors
• BG Processors : Java Programs to
• RabbitMq -> Redis : Allot Queue numbers to visitor
requests and insert to Redis
• Redis -> Booking Engine : Moderate the movement of
queued visitors from Redis to Booking Engine
• Process the Response Status / Booking Status / Inactive
Visitors / Timeouts
• 2 BG Processor node per sector
• CPU intensive : C1.Xlarge Instance Type
31
• 32 BG Processor Instances overall
32. Overall Solution Architecture
Sector is not an AWS
term , it is 8KMiles term
for Logical EC2 instance Amazon
groups for this use case Route 53
Sector 1 2 3 4 5 .. .. 16
HAProxy
Web/App
RabbitMQ
Redis
BG Programs 32
Booking Engine
34. Scalability
• New sectors containing LB, Web, Queue ,
NoSQL , BG stack will be created
automatically depending upon the load
• Same AZ or multi-AZ can be specified for the
creation
• CloudWatch Custom parameters used
• Automated Java Programs were used for the
sector creation
• No Manual intervention needed
34
35. High Availability @ Instance level
AZ-1 Amazon Virtual Private Cloud
AZ-2
Amazon
Route 53
EC2 Instances EC2 Instances EC2 Instances EC2 Instances
VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2
Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
EC2 Instances EC2 Instances EC2 Instances EC2 Instances
VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2
Availability Zone 1 Availability Zone 1
Availability Zone 2 Availability Zone 2
36. High Availability @ Instance
• HA built @ Web/App , Redis and BG
processor instances
• Any Failure / Non responsive EC2 instances
will be automatically detected/replaced by
Java programs
• No Manual intervention needed
36
38. High Availability @ Sector level
• Any Failure / Non responsive instances inside
Sectors will be automatically
detected/replaced by Java programs
• If sector-3 fails , still other sectors will be
active and can take requests
38
39. High Availability @ AZ Level
AZ-1 Amazon Virtual Private Cloud
AZ-2
Amazon
Route 53
EC2 Instances EC2 Instances EC2 Instances EC2 Instances
VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2
Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
EC2 Instances EC2 Instances EC2 Instances EC2 Instances
VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2
Availability Zone 1 Availability Zone 1
Availability Zone 2 Availability Zone 2
40. High Availability @ AZ level
• If entire AZ-2 fails then load will be balanced
to instances in AZ-1
• Automated programs will create new sectors
inside AZ-1 to handle the load
40
41. Log Analytics
HDFS Cluster
1 2 3
EC2 S3 RDS
Instances Bucket MySQL
with logs Elastic Map Reduce
Jobs
• Redis , Web/App , HAProxy , RBQ logs synced to S3
• Elastic MapReduce Jobs to process / analyze the logs
• Processed result moved to RDS MySQL for reports/ 41
Visualizations
42. Monitoring
• Nagios + Puppet (combined) for Auto
scaled monitoring infra and deployment
• CloudWatch Custom metrics / Tomcat
Valve/ Automated Java Programs for EC2
42
43. Backup
• No backups -> only Syncs to S3
• Golden AMI’s snapshot to S3
• Periodic Sync of data between EC2 and S3
• Periodic log Sync between Web/App to S3
43
44. Infrastructure
• Amazon Route53
• Amazon VPC – Public , Private subnet
• 150+ EC2 instances , 2 AZ’s , 1 Region
• 70+ Elastic IP’s
• 200+ EBS
• S3 buckets
• Suite of monitoring tools
• 1 Puppet Server
• Amazon CloudWatch
44
• Amazon CloudFront
45. Infrastructure Elasticity
• Entire Infra created 2 hrs before promo
• Tear down infra 2 hrs after promo
• ~30 Mins to launch the infra in AWS
• ~45 Mins to tear down
• Automated Failure detection/rectification
• Automated Programs for Infra creation
45
46. Infrastructure Cost
• ~10K USD per promo
• Not inclusive of Data charges
• Unthinkable Savings
• Visitor experience was good
• More Bookings per Promo
Power of Elasticity is Simply priceless
46
AWS is “AWSome”
47. If you need help in architecting Highly Elastic
solutions on AWS?
48. Leave it to the experts , we will
handle this
Cloud Architecture Consulting
Cloud Application Development
Cloud Migration & Implementation
Cloud Adoption Strategy
“Let's get the job done”