Advanced Topics - Session 4 - Architecting for High Availability

Architecting for high
availability
Ianni Vamvadelis, Solution Architect

What is High Availability (HA)?

• Percentage of time an application operates
• Loss of availability is known as an outage or downtime
– Planned and unplanned
– App is offline, unreachable, or partially available
– App is unresponsive

2

HA is related to …
• Scalability
– Often slow is indistinguishable from unavailable.
• Fault Tolerance
– Apps continue functioning when components fail
• Disaster Recovery
– Restoring service after a catastrophic event

3

HA and DR High Availability Disaster Recovery

• A continuum
• business continuity plan
• Not all or nothing proposition

In the face of internal or external events, how do you…
– Keep your applications running 24x7
– Make sure you data is safe
– Get an application recovered after a major disaster

4

How does AWS Help
High Availability?

US-WEST (Oregon)
EU-WEST (Ireland)

AWS GovCloud (US)
ASIA PAC (Tokyo)

US-EAST (Virginia)

ASIA PAC (Sydney)
US-WEST (N. California)

ASIA PAC
(Singapore)

SOUTH AMERICA (Sao Paulo)

US-WEST (Oregon))
EU-WEST (Ireland)

AWS GovCloud (US)
ASIA PAC (Tokyo)

US-EAST (Virginia)

ASIA PAC (Sydney)
US-WEST (N. California)

ASIA PAC
(Singapore)

SOUTH AMERICA (Sao Paulo)

AWS SERVICES
Inherently Highly Available and Highly Available with
Fault Tolerant Services the right architecture

 Amazon S3  Amazon SQS  Amazon EC2
 Amazon DynamoDB  Amazon SNS  Amazon EBS
 Amazon CloudFront  Amazon SES  Amazon RDS
 Amazon Route53  Amazon SWF
 Amazon VPC
 Elastic Load Balancing  …

1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING

LET’S BUILD A
HIGHLY AVAILABLE
SYSTEM

#1
DESIGN FOR FAILURE
●○○○○

« Everything fails
all the time »
Werner Vogels
CTO of Amazon

AVOID SINGLE POINTS OF FAILURE

AVOID SINGLE POINTS OF FAILURE

ASSUME EVERYTHING FAILS,
AND WORK BACKWARDS

YOUR GOAL
Applications should continue to function

AMAZON EBS
ELASTIC BLOCK STORE

AMAZON ELB
ELASTIC LOAD BALANCING

# 2
MULTIPLE
AVAILABILITY ZONES
●●○○○

AUTO SCALING
SCALE UP/DOWN EC2 CAPACITY

#4
SELF-HEALING
●●●●○

HEALTH CHECKS
+
AUTO SCALING

HEALTH CHECKS
+
AUTO SCALING
=
SELF-HEALING

AMAZON S3
STATIC WEBSITE
+
AMAZON ROUTE 53
WEIGHTED RESOLUTION

#5
LOOSE
COUPLING
●●●●●

BUILD LOOSELY
COUPLED SYSTEMS
The looser they are coupled,
the bigger they scale,
the more fault tolerant they get…

AMAZON SQS
SIMPLE QUEUE SERVICE

PUBLISH&
RECEIVE TRANSCODE
NOTIFY

CLOUDWATCH METRICS
FOR AMAZON SQS
+
AUTO SCALING

IT’S ALL ABOUT

CHOICE
BALANCE COST & HIGH AVAILABILITY

Summary
Leverage AWS Services

Apply 5 principles for HA

Automate

Test your HA implementation

117

aws.amazon.com/architecture

118

JUST EAT WITH AWS
HIGH AVAILABILITY

JUST EAT
 13 countries
 34,000+ restaurants
 8m+ members
 Over 50m orders
 16,000+ restaurants in UK, 8m visits a month

120

PLATFORM
Devices in restaurants

Apps and
External
Services
Consumer Public API Customer Restaurant
Website Care Tools Services

APIs

Order API Ratings API Search API … …

Common
Infrastructure
SQL Server Networking Monitoring Emails

121

DESIGN FOR FAILURE
Devices in restaurants

Web Device
Service Service
Orders
eu-west-1a queue eu-west-1a

Web JCT
Device
Service Service
Service

eu-west-1b
Orders eu-west-1b
data

Web
Service

eu-west-1c eu-west-1c
Auto scaling Group Auto scaling Group

122

SCALING - PROACTIVE

123

SCALING - PROACTIVE

Web servers in data center

124

SCALING – PROACTIVE


Web EC2 instances

125

SCALING – REACTIVE


Web EC2 instances

126

EVERYTHING MULTI AZ – CONSUMER WEBSITE

99%
66% 99%
66% 66%

Monitor to keep resource usage at
eu-west-1a eu-west-1b eu-west-1c
max of 66% of capacity in each AZ
when everything’s available.
Auto scaling Group

127

EVERYTHING MULTI AZ – INTERNAL APIS
Applications assume that internal APIs will fail
or run slowly. So can cope with the loss of an AZ
or instances – will just degrade gracefully.

100%
80% 80%
100% 80%

Alarms tell us that performance has
been degraded – but platform will
self heal as new instances are
launched.
Auto scaling Group

128

EVERYTHING MULTI AZ – SQL SERVER 2012
Connection strings simply contain
both primary and secondary servers –
no code changes required.

Primary Witness Alarms tell us that failover has
Secondary
occurred, but it happens without
manual intervention.

129

www.just-eat.com/jobs

DANIEL RICHARDSON twitter.com/JustEatUK
DIRECTOR OF ENGINEERING, JUST EAT

daniel.richardson@just-eat.com
www.facebook.com/justeat

Advanced Topics - Session 4 - Architecting for High Availability

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Advanced Topics - Session 4 - Architecting for High Availability

Similar to Advanced Topics - Session 4 - Architecting for High Availability (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Advanced Topics - Session 4 - Architecting for High Availability