The art of infrastructure elasticity

  • 10,881 views
Uploaded on

Art of designing an elastic and scalable infrastructure for an Queuing application

Art of designing an elastic and scalable infrastructure for an Queuing application

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,881
On Slideshare
0
From Embeds
0
Number of Embeds
43

Actions

Shares
Downloads
0
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Art of Infrastructure Elasticity April 28th, 2012 Cloud Developer Conference 2012 , Bangalore Harish Ganesan CTO and Co-Founder 8KMiles Harish11g.AWS@gmail.com
  • 2. Agenda• Problem• Challenges• Requirements• Solution Architecture• Q&A 2
  • 3. What is the problem scenario ? 3
  • 4. Big Sales Promotion every quarter by the Enterprise 4
  • 5. • Massive online Concurrent Visitors• Limited processing capacity of the Booking Engine (~3k requests/sec) 5
  • 6. • Unhappy Visitors• More Booking opportunity lost 6
  • 7. Solution (Step 1):• Create a Queuing App before the Booking engine• Efficiently Queue the concurrent visitors 7
  • 8. Solution (Step 2) :Moderate and move the visitors waiting in Queuingapp to Booking engine 8
  • 9. What are the Challenges ? 9
  • 10. Concurrency• HTTP/AJAX/REST requests • Total : 500+ Million requests in 6 hours • Average :23k+ requests/sec • Peak : 80K+ requests/sec 10
  • 11. Queue efficiency• Allot unique Queue Numbers for visitors• Queue Number allotment on Fair Basis (As much possible)• Reduce the wait time in Queue Number allotment process• Reduce overall Queue wait time for the 11 visitor
  • 12. Load Volatility Peak utilization duringCompute Promos Wasted Capacity Yearly Complete under utilization of Infra other times • Massive utilization and under utilization 12 pattern
  • 13. IP Whitelisting Public Cloud 3rd Party Services IP Address of the source EC2 Instances needs to be whitelisted in 3rd party Services gateway• Booking engine needs EC2 IP Whitelisting for security 13• Consecutive IP range needed
  • 14. Variety of OS / Software’s• RedHat OS for Load Balancer , NoSQL and Queue Layer• Apache Tomcat Java web/App Layer• CentOS for Processing Programs• MySQL for Result storage 14• Hadoop for Analytics
  • 15. What are the requirements from enterprise ? 15
  • 16. Requirements• Elastic Infrastructure • Create the Infrastructure 2 hrs before the promo • Tear down infrastructure 2 hrs after the promo • Elastically expand the infra during the promo• Highly Scalable and Available• Log Analytics 16• Complete Infrastructure Automation
  • 17. Solution Architecture 17
  • 18. Solution ArchitectureOption 1: Single Queue ( Initial thought) Queuing Application BookingConcurrent Engine visitors 18
  • 19. Solution ArchitectureOption 2: Parallel Queue ( Recommended) BookingConcurrent Queuing Engine visitors Application 19
  • 20. Request types• Customer Visit is a HTTP request to the Queuing Application• Current Visitor Queue position is a AJAX call every X seconds to the Queuing Application • More Wait ~ More Calls 20
  • 21. Solution Step 1 : The Cloud ?• Amazon Web Services• We had 4+ years Architecture experience in AWS• It satisfied many customer requirements and 21 challenges in this use case
  • 22. Solution Step 2 : R53/NW Amazon Virtual Private CloudUsers Amazon Route 53 EC2 Instances on AWS VPC Subnet 1 VPC Subnet 2 Availability Zone 1 Availability Zone 2 Users• Amazon VPC with Multi-AZ subnet configurations ( HA )• Amazon Route 53 for Managed DNS 22• DNS RR algorithm at Route53
  • 23. Solution Step 3 : Load Balancing Amazon Virtual Private CloudUsers Amazon Route 53 EBS M1.large EBS M1.large Elastic IP Elastic IP Volumes Volumes HAProxy EC2 Instance –1 HAProxy EC2 Instance –2 Users Round Robin Round Robin Algorithm Algorithm VPC Subnet 1 23 Availability Zone 1
  • 24. Solution Step 3: Load Balancing• HAProxy vs Amazon ELB• Custom programs to Auto Scale HAProxy• HAProxy Elastic -> Attach / Detach from Route53• HAProxy IP whitelisting in 3rd party Gateway• 16 HAProxy Instances , 2 AZ’s , 2 Subnets• RR Load Balancing algorithm 24
  • 25. Solution Step 4 : Web/App Servers Amazon Virtual Private CloudUsers Amazon Route 53 HA Proxy EC2 Instance-1 Round Robin Algorithm Users EBS C1.Xlarge Elastic IP Web/App 2 Web/App 3 Volumes Web/App EC2 Instance –1 VPC Subnet 1 25 Availability Zone 1
  • 26. Solution Step 4: Web/App Servers• 3 Web/App instances under every HAProxy• C1.Xlarge Instance Type for Web/App Instances• Custom programs to Auto Scale C1.Xlarge• Automatic Attach / Detach from HAProxy• Every web/App Instance with EIP for IP whitelisting• 48 Web/App EC2 Instances spread across 2 AZ’s 26
  • 27. Solution Step 5 : Queue Servers Amazon Virtual Private Cloud HA Proxy EC2 Instance-1Users Amazon Route 53 Round Robin Algorithm Users Web/App 1 Web/App 2 Web/App 3 EBS m1.large Volumes RabbitMQ VPC Subnet 1 27 Availability Zone 1
  • 28. Solution Step 5: Queue Servers• RabbitMQ vs Amazon SQS• FIFO/Concurrency/No Duplicate messages• 1 RabbitMQ instance for queuing every sector• M1. large Instance Type• 16 RabbitMQ Instances overall 28
  • 29. Solution Step 6 : Processors/Redis Amazon Single Sector View Components of Route 53 Single Sector 1 1. One HAProxy 2. Three Web/App HA Proxy 3. One RabbitMQ 4. One BG Round Robin Processor Node 2 Algorithm 5. Two Redis Sector is not an AWS term , it is 8KMiles term for Web/App 1 Web/App 2 Web/App 3 Logical EC2 3 instance groups for this use case RabbitMQ 4 5 Redis Master 29Processors 6 7 Processors Redis Slave Booking Engine
  • 30. Solution Step 6: Redis• Redis vs Amazon DynamoDB• Redis : NoSQL KV Data store• Visitors are shown their Current Queue position every X seconds from Redis• 1 Redis Master-Slave instance for every sector• M1. large Instance Type for Redis 30• 32 Redis Instances overall
  • 31. Solution Step 6: Processors• BG Processors : Java Programs to • RabbitMq -> Redis : Allot Queue numbers to visitor requests and insert to Redis • Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine • Process the Response Status / Booking Status / Inactive Visitors / Timeouts• 2 BG Processor node per sector• CPU intensive : C1.Xlarge Instance Type 31• 32 BG Processor Instances overall
  • 32. Overall Solution ArchitectureSector is not an AWSterm , it is 8KMiles termfor Logical EC2 instance Amazongroups for this use case Route 53 Sector 1 2 3 4 5 .. .. 16 HAProxy Web/App RabbitMQ Redis BG Programs 32 Booking Engine
  • 33. Scalability AZ-1 Amazon Virtual Private Cloud AZ-2 Sector -1 Sector -3AmazonRoute 53 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2 Sector -2 Sector -4 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
  • 34. Scalability• New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load• Same AZ or multi-AZ can be specified for the creation• CloudWatch Custom parameters used• Automated Java Programs were used for the sector creation• No Manual intervention needed 34
  • 35. High Availability @ Instance level AZ-1 Amazon Virtual Private Cloud AZ-2AmazonRoute 53 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
  • 36. High Availability @ Instance• HA built @ Web/App , Redis and BG processor instances• Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs• No Manual intervention needed 36
  • 37. High Availability @ Sector level AZ-1 Amazon Virtual Private Cloud AZ-2 Sector -1 Sector -2 Sector -5 Sector -3AmazonRoute 53 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2 Sector -6 Sector -4 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
  • 38. High Availability @ Sector level• Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs• If sector-3 fails , still other sectors will be active and can take requests 38
  • 39. High Availability @ AZ Level AZ-1 Amazon Virtual Private Cloud AZ-2AmazonRoute 53 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2 EC2 Instances EC2 Instances EC2 Instances EC2 Instances VPC Subnet 1 VPC Subnet 1 VPC Subnet 2 VPC Subnet 2 Availability Zone 1 Availability Zone 1 Availability Zone 2 Availability Zone 2
  • 40. High Availability @ AZ level• If entire AZ-2 fails then load will be balanced to instances in AZ-1• Automated programs will create new sectors inside AZ-1 to handle the load 40
  • 41. Log Analytics HDFS Cluster 1 2 3 EC2 S3 RDSInstances Bucket MySQL with logs Elastic Map Reduce Jobs• Redis , Web/App , HAProxy , RBQ logs synced to S3• Elastic MapReduce Jobs to process / analyze the logs• Processed result moved to RDS MySQL for reports/ 41 Visualizations
  • 42. Monitoring• Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment• CloudWatch Custom metrics / Tomcat Valve/ Automated Java Programs for EC2 42
  • 43. Backup• No backups -> only Syncs to S3• Golden AMI’s snapshot to S3• Periodic Sync of data between EC2 and S3• Periodic log Sync between Web/App to S3 43
  • 44. Infrastructure• Amazon Route53• Amazon VPC – Public , Private subnet• 150+ EC2 instances , 2 AZ’s , 1 Region• 70+ Elastic IP’s• 200+ EBS• S3 buckets• Suite of monitoring tools• 1 Puppet Server• Amazon CloudWatch 44• Amazon CloudFront
  • 45. Infrastructure Elasticity• Entire Infra created 2 hrs before promo• Tear down infra 2 hrs after promo• ~30 Mins to launch the infra in AWS• ~45 Mins to tear down• Automated Failure detection/rectification• Automated Programs for Infra creation 45
  • 46. Infrastructure Cost• ~10K USD per promo• Not inclusive of Data charges• Unthinkable Savings• Visitor experience was good• More Bookings per PromoPower of Elasticity is Simply priceless 46AWS is “AWSome”
  • 47. If you need help in architecting Highly Elasticsolutions on AWS?
  • 48. Leave it to the experts , we willhandle thisCloud Architecture ConsultingCloud Application DevelopmentCloud Migration & ImplementationCloud Adoption Strategy “Lets get the job done”
  • 49. Q&AHarish11g.aws@gmail.comhttp://in.linkedin.com/in/harishganesanwww.twitter.com/harish11ghttp://harish11g.blogspot.comAmazon Web Servicesaws.amazon.comaws.amazon.com/contact-us/aws-sales 49