Advanced Topics - Session 4 - Architecting for High Availability

1,433 views
1,313 views

Published on

AWS provides a platform that is ideally suited for building highly available systems, enabling you to build reliable, affordable, fault-tolerant systems that operate with a minimal amount of human interaction. This presentation covers many of the high-availability and fault-tolerance concepts and features of the various services that you can use to build highly reliable and highly available applications in the AWS Cloud: architectures involving multiple Availability Zones, including EC2 best practices and RDS Multi-AZ deployments; loosely coupled and self-healing systems involving SQS and Auto Scaling; networking best practices for high availability, including Elastic IP addresses, load balancing, and DNS; leveraging services that inherently are built with high-availability and fault tolerance in mind, including S3, Elastic Beanstalk and more.

Ianni Vamvadelis, Manager, Solution Architecture, AWS
Daniel Richardson, Director of Engineering, JustEat

Published in: Technology, Business

Advanced Topics - Session 4 - Architecting for High Availability

  1. 1. Architecting for highavailabilityIanni Vamvadelis, Solution Architect
  2. 2. What is High Availability (HA)?• Percentage of time an application operates• Loss of availability is known as an outage or downtime – Planned and unplanned – App is offline, unreachable, or partially available – App is unresponsive 2
  3. 3. HA is related to …• Scalability – Often slow is indistinguishable from unavailable.• Fault Tolerance – Apps continue functioning when components fail• Disaster Recovery – Restoring service after a catastrophic event 3
  4. 4. HA and DR High Availability Disaster Recovery• A continuum• business continuity plan• Not all or nothing propositionIn the face of internal or external events, how do you… – Keep your applications running 24x7 – Make sure you data is safe – Get an application recovered after a major disaster 4
  5. 5. How does AWS Help High Availability?
  6. 6. US-WEST (Oregon) EU-WEST (Ireland) AWS GovCloud (US) ASIA PAC (Tokyo) US-EAST (Virginia) ASIA PAC (Sydney)US-WEST (N. California) ASIA PAC (Singapore) SOUTH AMERICA (Sao Paulo)
  7. 7. US-WEST (Oregon)) EU-WEST (Ireland) AWS GovCloud (US) ASIA PAC (Tokyo) US-EAST (Virginia) ASIA PAC (Sydney)US-WEST (N. California) ASIA PAC (Singapore) SOUTH AMERICA (Sao Paulo)
  8. 8. Automation 8
  9. 9. AWS SERVICESInherently Highly Available and Highly Available withFault Tolerant Services the right architecture Amazon S3  Amazon SQS  Amazon EC2 Amazon DynamoDB  Amazon SNS  Amazon EBS Amazon CloudFront  Amazon SES  Amazon RDS Amazon Route53  Amazon SWF  Amazon VPC Elastic Load Balancing  …
  10. 10. AWSPrinciples for HA
  11. 11. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  12. 12. LET’S BUILD AHIGHLY AVAILABLE SYSTEM
  13. 13. #1DESIGN FOR FAILURE ●○○○○
  14. 14. « Everything fails all the time » Werner Vogels CTO of Amazon
  15. 15. AVOID SINGLE POINTS OF FAILURE
  16. 16. AVOID SINGLE POINTS OF FAILUREASSUME EVERYTHING FAILS,AND WORK BACKWARDS
  17. 17. YOUR GOALApplications should continue to function
  18. 18. AMAZON EBS ELASTIC BLOCK STORE
  19. 19. AMAZON ELBELASTIC LOAD BALANCING
  20. 20. HEALTH CHECKS
  21. 21. # 2 MULTIPLEAVAILABILITY ZONES ●●○○○
  22. 22. AMAZON RDSMULTI-AZ
  23. 23. AMAZON ELB ANDMULTIPLE AZs
  24. 24. #3SCALING ●●●○○
  25. 25. AUTO SCALINGSCALE UP/DOWN EC2 CAPACITY
  26. 26. #4SELF-HEALING ●●●●○
  27. 27. HEALTH CHECKS +AUTO SCALING
  28. 28. HEALTH CHECKS + AUTO SCALING =SELF-HEALING
  29. 29. DEGRADED MODE
  30. 30. AMAZON S3 STATIC WEBSITE + AMAZON ROUTE 53WEIGHTED RESOLUTION
  31. 31. #5 LOOSECOUPLING ●●●●●
  32. 32. BUILD LOOSELYCOUPLED SYSTEMS The looser they are coupled, the bigger they scale, the more fault tolerant they get…
  33. 33. AMAZON SQS SIMPLE QUEUE SERVICE
  34. 34. PUBLISH&RECEIVE TRANSCODE NOTIFY
  35. 35. PUBLISH&RECEIVE TRANSCODE NOTIFY
  36. 36. VISIBILITY TIMEOUT
  37. 37. BUFFERING
  38. 38. CLOUDWATCH METRICS FOR AMAZON SQS + AUTO SCALING
  39. 39. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  40. 40. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  41. 41. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  42. 42. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  43. 43. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  44. 44. 1. DESIGN FOR FAILURE2. MULTIPLE AVAILABILITY ZONES3. SCALING4. SELF-HEALING5. LOOSE COUPLING
  45. 45. YOUR GOALApplications should continue to function
  46. 46. IT’S ALL ABOUT CHOICEBALANCE COST & HIGH AVAILABILITY
  47. 47. SummaryLeverage AWS ServicesApply 5 principles for HAAutomateTest your HA implementation 117
  48. 48. aws.amazon.com/architecture 118
  49. 49. JUST EAT WITH AWSHIGH AVAILABILITY
  50. 50. JUST EAT 13 countries 34,000+ restaurants 8m+ members Over 50m orders 16,000+ restaurants in UK, 8m visits a month 120
  51. 51. PLATFORM Devices in restaurants Apps and External Services Consumer Public API Customer Restaurant Website Care Tools Services APIs Order API Ratings API Search API … … Common Infrastructure SQL Server Networking Monitoring Emails 121
  52. 52. DESIGN FOR FAILURE Devices in restaurants Web Device Service Service Orders eu-west-1a queue eu-west-1a Web JCT Device Service Service Service eu-west-1b Orders eu-west-1b data Web Service eu-west-1c eu-west-1c Auto scaling Group Auto scaling Group 122
  53. 53. SCALING - PROACTIVE 123
  54. 54. SCALING - PROACTIVE Web servers in data center 124
  55. 55. SCALING – PROACTIVE Web servers in data center Web EC2 instances 125
  56. 56. SCALING – REACTIVE Web servers in data center Web EC2 instances 126
  57. 57. EVERYTHING MULTI AZ – CONSUMER WEBSITE 99% 66% 99% 66% 66% Monitor to keep resource usage at eu-west-1a eu-west-1b eu-west-1c max of 66% of capacity in each AZ when everything’s available. Auto scaling Group 127
  58. 58. EVERYTHING MULTI AZ – INTERNAL APIS Applications assume that internal APIs will fail or run slowly. So can cope with the loss of an AZ or instances – will just degrade gracefully. 100% 80% 80% 100% 80% Alarms tell us that performance has eu-west-1a eu-west-1b eu-west-1c been degraded – but platform will self heal as new instances are launched. Auto scaling Group 128
  59. 59. EVERYTHING MULTI AZ – SQL SERVER 2012 Connection strings simply contain both primary and secondary servers – no code changes required. Primary Witness Alarms tell us that failover has Secondary eu-west-1a eu-west-1b eu-west-1c occurred, but it happens without manual intervention. 129
  60. 60. www.just-eat.com/jobsDANIEL RICHARDSON twitter.com/JustEatUKDIRECTOR OF ENGINEERING, JUST EATdaniel.richardson@just-eat.com www.facebook.com/justeat

×