Surviving an Amazon Outage
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Surviving an Amazon Outage

on

  • 540 views

 

Statistics

Views

Total Views
540
Views on SlideShare
497
Embed Views
43

Actions

Likes
0
Downloads
0
Comments
0

6 Embeds 43

http://narmitag.wordpress.com 14
http://sbjcconsulting.com 13
http://data-replication.blogspot.co.uk 10
http://192.168.234.238 3
http://data-replication.blogspot.in 2
http://data-replication.blogspot.ru 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Surviving an Amazon Outage Presentation Transcript

  • 1. ©Continuent 2012.SurvivingAn Amazon OutageNeil Armitage, Cluster implementation Engineer,ContinuentWednesday, 24 April 13
  • 2. ©Continuent 2012 2Overview• Continuent’s external/internal infrastructureis built in AWS• Review carried out in the Summer of 2012after several AWS Outages• Treated the review as a Customerengagement• Further review in Autumn of 2012 leading tothe Multi-Cloud deploymentWednesday, 24 April 13
  • 3. ©Continuent 2012What is AWSAmazon Web Services is a collection of remotecomputing services (also called web services)that together make up a cloud computingplatform.The central services are EC2 (Compute) and S3(Storage) Services.3Wednesday, 24 April 13
  • 4. ©Continuent 2012AWS Regions4Ireland(3 AZ)Sao Paulo(2 AZ)Northern Virginia(5 AZ)Oregon(3 AZ)California(3 AZ)Singapore(2 AZ)Tokyo(3 AZ)Sydney(2 AZ)Wednesday, 24 April 13
  • 5. ©Continuent 2012AWS Availability Zones5RegionAvailability Zone Availability ZoneAvailability ZoneRegionAvailability Zone Availability ZoneWednesday, 24 April 13
  • 6. ©Continuent 2012AWS Services• Compute EC2• Network - Route 53 and Virtual Private Cloud(VPC)• Content Delivery - Cloudfront• Storage - S3, Glacier, EBS• Database - DynamoDB, RDS, RedShift,SimpleDB• Deployment - Cloudformation, Beanstalk,OpsWorks6Wednesday, 24 April 13
  • 7. ©Continuent 2012AWS Size*• Between 100K and 500K physical servers• 1.5million Public IP Addresses• S3 holds > 2 Trillion objects - 1.1m requestsper second• 1/3 of daily users access a site running onAWS• 1% of internet tra!c goes through AmazonInfrastructure7* Estimates based on various internet sourcesWednesday, 24 April 13
  • 8. ©Continuent 2012Continuent Systems• External facing website• Jira/Con"uence internal systems• Subversion• Jenkins build system8Wednesday, 24 April 13
  • 9. ©Continuent 2012External Website9Internet ElasticIPWebServerDBServerRegionAvailability ZoneWednesday, 24 April 13
  • 10. ©Continuent 2012Jira/Con!uence/Subversion10Internet ElasticIPApp ServerJiraConfluenceSVN ServerMySQLAvailability ZoneRegionWednesday, 24 April 13
  • 11. ©Continuent 2012AWS Problems Summer 2012“Amazon Cloud Hit by Real Clouds, DowningNet!ix, Instagram, Other Sites”Severe Storms caused power outages atAWS US-East Data centers, generators failedtaking out 7% of EC2 instances.http://www.pcworld.com/article/258627/amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html11Wednesday, 24 April 13
  • 12. ©Continuent 2012Migration Plan• Move to a clustered Continuent Tungstenenvironment• Ensure all components are replicated into atleast one other AWS Region• Limited downtime on Customer facingsystems• Minimal downtime on internal systems12Wednesday, 24 April 13
  • 13. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerData Service: nycWednesday, 24 April 13
  • 14. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 15. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 16. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 17. ©Continuent 2012Website Database Tier - Round 114RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 18. ©Continuent 2012DB Failures - Failure in US-EAST-1C15RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 19. ©Continuent 2012DB Failures - Failure in US-EAST16RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 20. ©Continuent 2012 17DEMOWednesday, 24 April 13
  • 21. ©Continuent 2012Website Web Tier - Round 118RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 22. ©Continuent 2012Web Failures - Failure in US-EAST-1C19RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 23. ©Continuent 2012Web Failures - Failure in US-EAST20RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPDNS UpdateWednesday, 24 April 13
  • 24. ©Continuent 2012Jira/Con!uence/SVN - Round 121RegionAvailability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 25. ©Continuent 2012AWS Failures - Autumn 2012“Amazon Web Services outage takes outpopular websites again”•EBS degraded performance•Problems allocating new volumeshttp://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular-websites-again.html22Wednesday, 24 April 13
  • 26. ©Continuent 2012Website Database Tier - Round 223RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11B 1C1CS3BackupsS3BackupsRackSpaceWednesday, 24 April 13
  • 27. ©Continuent 2012Website Web Tier - Round 224RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPRackSpaceWednesday, 24 April 13
  • 28. ©Continuent 2012Jira/Con!uence/SVN - Round 225RegionAvailability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11C1CS3BackupsS3BackupsInternetEIPRackSpaceWednesday, 24 April 13
  • 29. ©Continuent 2012Best Practices• RAID EBS Volumes (RAID1)• Backups• xtrabackup (backed up into S3)• EBS Snapshot26ec2-­‐consistent-­‐snapshot    -­‐-­‐mysql  -­‐-­‐freeze-­‐filesystem  /vol    -­‐-­‐region  eu-­‐west-­‐1      -­‐-­‐description  "$(hostanme)  RAID  snapshot  $(date  +%Y-­‐%m-­‐%d  %H:%M:%S)"    vol-­‐1f9a6446  vol-­‐649a643dWednesday, 24 April 13
  • 30. ©Continuent 2012Best Practices• Monitoring• Nagios scripts converted to email alerts• New Relic27Wednesday, 24 April 13
  • 31. ©Continuent 2012Lesson Learnt• EC2 Instances fail• One of anything is never enough• Don’t assume you can spin up more resourcesinstantly• Think multi-cloud, public/private• Resources are disposable - throw away andrebuild if needed28Wednesday, 24 April 13
  • 32. ©Continuent 2012Further Plans• Realtime replication of web assets(glusterFS?)• Introduce a Elastic Load Balancer in front ofUS-EAST Web servers to allow for auto webfailover• Migrate into a VPC• Investigate Route 53 for DNS Failover29Wednesday, 24 April 13
  • 33. ©Continuent 2012 30We are RecruitingCome to our booth for more infomationWednesday, 24 April 13
  • 34. ©Continuent 2012 31Continuent Website:http://www.continuent.comTungsten Replicator 2.0:http://code.google.com/p/tungsten-replicatorOur Blogs:http://scale-out-blog.blogspot.comhttp://datacharmer.blogspot.comhttp://flyingclusters.blogspot.com560 S.Winchester Blvd., Suite 500San Jose, CA 95128Tel +1 (866) 998-3642Fax +1 (408) 668-1009e-mail: sales@continuent.comWednesday, 24 April 13