Your SlideShare is downloading. ×
0
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Surviving an Amazon Outage
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Surviving an Amazon Outage

343

Published on

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
343
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
1
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. ©Continuent 2012.SurvivingAn Amazon OutageNeil Armitage, Cluster implementation Engineer,ContinuentWednesday, 24 April 13
  • 2. ©Continuent 2012 2Overview• Continuent’s external/internal infrastructureis built in AWS• Review carried out in the Summer of 2012after several AWS Outages• Treated the review as a Customerengagement• Further review in Autumn of 2012 leading tothe Multi-Cloud deploymentWednesday, 24 April 13
  • 3. ©Continuent 2012What is AWSAmazon Web Services is a collection of remotecomputing services (also called web services)that together make up a cloud computingplatform.The central services are EC2 (Compute) and S3(Storage) Services.3Wednesday, 24 April 13
  • 4. ©Continuent 2012AWS Regions4Ireland(3 AZ)Sao Paulo(2 AZ)Northern Virginia(5 AZ)Oregon(3 AZ)California(3 AZ)Singapore(2 AZ)Tokyo(3 AZ)Sydney(2 AZ)Wednesday, 24 April 13
  • 5. ©Continuent 2012AWS Availability Zones5RegionAvailability Zone Availability ZoneAvailability ZoneRegionAvailability Zone Availability ZoneWednesday, 24 April 13
  • 6. ©Continuent 2012AWS Services• Compute EC2• Network - Route 53 and Virtual Private Cloud(VPC)• Content Delivery - Cloudfront• Storage - S3, Glacier, EBS• Database - DynamoDB, RDS, RedShift,SimpleDB• Deployment - Cloudformation, Beanstalk,OpsWorks6Wednesday, 24 April 13
  • 7. ©Continuent 2012AWS Size*• Between 100K and 500K physical servers• 1.5million Public IP Addresses• S3 holds > 2 Trillion objects - 1.1m requestsper second• 1/3 of daily users access a site running onAWS• 1% of internet tra!c goes through AmazonInfrastructure7* Estimates based on various internet sourcesWednesday, 24 April 13
  • 8. ©Continuent 2012Continuent Systems• External facing website• Jira/Con"uence internal systems• Subversion• Jenkins build system8Wednesday, 24 April 13
  • 9. ©Continuent 2012External Website9Internet ElasticIPWebServerDBServerRegionAvailability ZoneWednesday, 24 April 13
  • 10. ©Continuent 2012Jira/Con!uence/Subversion10Internet ElasticIPApp ServerJiraConfluenceSVN ServerMySQLAvailability ZoneRegionWednesday, 24 April 13
  • 11. ©Continuent 2012AWS Problems Summer 2012“Amazon Cloud Hit by Real Clouds, DowningNet!ix, Instagram, Other Sites”Severe Storms caused power outages atAWS US-East Data centers, generators failedtaking out 7% of EC2 instances.http://www.pcworld.com/article/258627/amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html11Wednesday, 24 April 13
  • 12. ©Continuent 2012Migration Plan• Move to a clustered Continuent Tungstenenvironment• Ensure all components are replicated into atleast one other AWS Region• Limited downtime on Customer facingsystems• Minimal downtime on internal systems12Wednesday, 24 April 13
  • 13. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerData Service: nycWednesday, 24 April 13
  • 14. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 15. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 16. ©Continuent 2012 13MasterSlave SlaveApp LogicTungsten ConnectorReplicator Replicator ReplicatorApp LogicTungsten ConnectorManager Manager ManagerMonitoringandcontrolMonitoringandcontrolData Service: nycWednesday, 24 April 13
  • 17. ©Continuent 2012Website Database Tier - Round 114RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 18. ©Continuent 2012DB Failures - Failure in US-EAST-1C15RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 19. ©Continuent 2012DB Failures - Failure in US-EAST16RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C 1CS3BackupsS3BackupsConnectorsWednesday, 24 April 13
  • 20. ©Continuent 2012 17DEMOWednesday, 24 April 13
  • 21. ©Continuent 2012Website Web Tier - Round 118RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 22. ©Continuent 2012Web Failures - Failure in US-EAST-1C19RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 23. ©Continuent 2012Web Failures - Failure in US-EAST20RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPDNS UpdateWednesday, 24 April 13
  • 24. ©Continuent 2012Jira/Con!uence/SVN - Round 121RegionAvailability ZoneRegionAvailability ZoneUS-EAST-1 US-WEST-11C1CS3BackupsS3BackupsInternetEIPWednesday, 24 April 13
  • 25. ©Continuent 2012AWS Failures - Autumn 2012“Amazon Web Services outage takes outpopular websites again”•EBS degraded performance•Problems allocating new volumeshttp://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular-websites-again.html22Wednesday, 24 April 13
  • 26. ©Continuent 2012Website Database Tier - Round 223RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11B 1C1CS3BackupsS3BackupsRackSpaceWednesday, 24 April 13
  • 27. ©Continuent 2012Website Web Tier - Round 224RegionAvailability Zone Availability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11B 1C1CS3BackupsS3BackupsInternetEIPRackSpaceWednesday, 24 April 13
  • 28. ©Continuent 2012Jira/Con!uence/SVN - Round 225RegionAvailability ZoneRegionAvailability ZoneUS-EAST-1US-WEST-11C1CS3BackupsS3BackupsInternetEIPRackSpaceWednesday, 24 April 13
  • 29. ©Continuent 2012Best Practices• RAID EBS Volumes (RAID1)• Backups• xtrabackup (backed up into S3)• EBS Snapshot26ec2-­‐consistent-­‐snapshot    -­‐-­‐mysql  -­‐-­‐freeze-­‐filesystem  /vol    -­‐-­‐region  eu-­‐west-­‐1      -­‐-­‐description  "$(hostanme)  RAID  snapshot  $(date  +%Y-­‐%m-­‐%d  %H:%M:%S)"    vol-­‐1f9a6446  vol-­‐649a643dWednesday, 24 April 13
  • 30. ©Continuent 2012Best Practices• Monitoring• Nagios scripts converted to email alerts• New Relic27Wednesday, 24 April 13
  • 31. ©Continuent 2012Lesson Learnt• EC2 Instances fail• One of anything is never enough• Don’t assume you can spin up more resourcesinstantly• Think multi-cloud, public/private• Resources are disposable - throw away andrebuild if needed28Wednesday, 24 April 13
  • 32. ©Continuent 2012Further Plans• Realtime replication of web assets(glusterFS?)• Introduce a Elastic Load Balancer in front ofUS-EAST Web servers to allow for auto webfailover• Migrate into a VPC• Investigate Route 53 for DNS Failover29Wednesday, 24 April 13
  • 33. ©Continuent 2012 30We are RecruitingCome to our booth for more infomationWednesday, 24 April 13
  • 34. ©Continuent 2012 31Continuent Website:http://www.continuent.comTungsten Replicator 2.0:http://code.google.com/p/tungsten-replicatorOur Blogs:http://scale-out-blog.blogspot.comhttp://datacharmer.blogspot.comhttp://flyingclusters.blogspot.com560 S.Winchester Blvd., Suite 500San Jose, CA 95128Tel +1 (866) 998-3642Fax +1 (408) 668-1009e-mail: sales@continuent.comWednesday, 24 April 13

×