Architecting for failures in the Cloud - Barcamp Bangalore 2013

658 views

Published on

This is a talk from Barcamp Bangalore 2013 on how to architect for dealing with failures in the Cloud.

P3 InfoTech Solutions Pvt. Ltd. helps organizations achieve business breakthroughs by adopting Cloud Computing through our Outsourced Product Development and Cloud Consulting service offerings. Check out our service offerings at http://www.p3infotech.in.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
658
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Architecting for failures in the Cloud - Barcamp Bangalore 2013

  1. 1. 1HOW YOU COULD HAVE SURVIVEDTHE AWS CHRISTMAS EVE OUTAGEARCHITECTING FOR FAILURES IN THECLOUDPavan VermaFounder, P3 InfoTech Solutions Pvt. Ltd.pavan@p3infotech.in, @YingYangPavan
  2. 2. 2How are datacenter failuresrelevant to me?
  3. 3. 3Types of Failures in a Datacenter• Electronic components – CPUs, Memory• Mechanical components – Hard Disks, Fans• Electrical components – Power Supplies, Air conditioners• Networking equipment – Network cables, Routers, Switches• Software bugs• Power disruption• Human errors
  4. 4. 4Cost of Failures• Tangible cost • Lost business = Business volume (Revenues) / Duration of failure • Cost of lost data or Time to re-create lost data• Intangible cost • Reputation • Frustration • Lost business opportunity
  5. 5. 5Techniques to deal with failures
  6. 6. 6Backup• Backup = Copy of the data from a time before the failure• Can restore data after the failure to a state from before the failure• Limits the extent of data loss during a failure• Types of Backup • Disk-to-tape – Offline, Slower restore, Cheaper • Disk-to-disk – Online, Faster restore, Costlier
  7. 7. 7Backup in AWS• Snapshots for EBS volumes• Database backups with RDS• Redundant copies of S3 objects (*)
  8. 8. 8High Availability (HA)• Ability of the application to service requests in spite of failure of some components• Most prevalent notion of High Availability • Ability to application to tolerate single component failures • Application has no single point of failure• How is high availability achieved • Redundant components • Switchover traffic from failed component to working component
  9. 9. 9High Availability (2)• Two types of redundant components • Active-Active • Active-Passive• Examples • Power Supplies • Servers • Databases
  10. 10. 10High Availability in AWS• Availability Zones (AZ)• Elastic Load Balancer• Database replicas
  11. 11. 11 Reference Architecture for High Availability setup in AWS User ELBAZ #1 AZ #2 Auto Scaling Group of Auto Scaling Group of EC2 Instances EC2 Instances EC2 EC2 EC2 EC2 S3 RDS RDS Master Slave
  12. 12. 12Disaster Recovery• Ability to resume operations after a disaster• How bad can a disaster be? • Entire datacenter may be destroyed or become inoperational • Examples: 9/11, Hurricane Sandy, Northeast blackout of 2003• Affects all Availability Zones in a Region
  13. 13. 13Disaster Recovery Solutions• Disaster recovery solutions involve combination of • Replication / Backup of data to a different geography [On-going] • Start operations from the DR site [when disaster occurs] • Switch-over traffic to DR site • Sync data and restore operations to primary site once it becomes operational again• Since DR involves a different Geo, data replication/backup happens over WAN
  14. 14. 14Recovery Point and Recovery Time• Recovery Point (RP) = Duration of time for which data is lost• Recovery Time (RT) = Duration of time in which the application is restored• Low numbers are better for both RP and RT• Often framed as RPO and RTO as part of business continuity planning
  15. 15. 15Recovery Point and Recovery Time• Backup = High RP and High RT• High Availability = Zero RP and Zero RT• Disaster Recovery = RP and RT between Backup and HA
  16. 16. 16Conclusion• Topic of IT failures is very relevant for business operations• Key issues with failures • Unavailability of application • Loss of data• Techniques to handle failures – Backups, High Availability, Disaster Recovery• AWS provides mechanisms to deal with failures• Recovery Point and Recovery Time
  17. 17. 17Have a great Barcamp Bangalore 2013!

×