More Related Content Similar to ARC319_Multi-Region Active-Active Architecture (20) More from Amazon Web Services (20) ARC319_Multi-Region Active-Active Architecture1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multi-Region Active-Active
Architecture
G l e n n G o r e , D i r e c t o r , S o l u t i o n s A r c h i t e c t u r e
G i r i s h P a t i l , S e n i o r S o l u t i o n s A r c h i t e c t
D a r i n B r i s k m a n , D e v e l o p e r E v a n g e l i s t
ARC319
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Session objectives
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Session objectives
1. Understand how to think about application availability
2. Understand how AWS makes services highly-available
3. Understand why you might need Multi-Region Active-Active
architecture
4. Review AWS services that are useful for Multi-Region Active-
Active design
5. Trade-offs involved in the design
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Understand how to think about
application availability
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Where are we?
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Calculating availability with hard
dependencies
Application
Dependency 1 Dependency 3
99% 90%
<= 90%
Dependency 2
95%
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“A chain is only as strong as its
weakest link.”
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Calculating availability with redundant
components
Application
Component 1 Component 1 Replica
90% 90%
~99%
Assuming
instantaneous
failover!
Let us understand
intuitively
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Component redundancy
increases availability
significantly!”
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Nines” of availability
Reliability Pillar White-paper (Nov 2017)
Availability Max Disruption (per year) Application Categories
99% 3 days 15 hours Batch processing, data extraction, transfer, and load jobs
99.90% 8 hours 45 minutes Internal tools like knowledge management, project tracking
99.95% 4 hours 22 minutes Online commerce, point of sale
99.99% 52 minutes Video delivery, ridesharing services, broadcast systems
99.999% 5 minutes ATM transactions, telecommunications systems
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Your application’s availability
depends …
… n o t only on availab ility of AWS se rvice s, b u t also on
the qu ality of you r software and you r adhe re nce to
op e rational b e st p ractice s !
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Nines” of availability
Reliability Pillar whitepaper (Nov 2017)
Availability Max Disruption (per year) Application Categories
99% 3 days 15 hours Batch processing, data extraction, transfer, and load jobs
99.90% 8 hours 45 minutes Internal tools like knowledge management, project tracking
99.95% 4 hours 22 minutes Online commerce, point of sale
99.99% 52 minutes Video delivery, ridesharing services, broadcast systems
99.999% 5 minutes ATM transactions, telecommunications systems
Consider Multi-Region
Active-Active for this class
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Understand how AWS makes services
highly available
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Region-wide AWS services
Amazon
DynamoDB
Amazon
RDS
Amazon
ElastiCache
Amazon
S3
Amazon
EFS
Amazon
SQS
Amazon
Kinesis
Amazon ES
Default
Configurable for
multi-AZ deployment
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
High availability for EC2 – instance
recovery
Instance ID,
private IP addresses,
Elastic IP addresses, and
all instance metadata
Instance ID,
private IP addresses,
Elastic IP addresses, and
all instance metadata
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html
Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network
reachability
New instance, same in all
manners as old instance
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
High availability for EC2 – Auto Scaling
Availability Zone 1 Availability Zone 2
Auto-scaling Group
Fresh server from AMI
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Understand why you might need
Multi-Region Active-Active
architecture
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serving geographically distributed
customer base
ChinaUSA West USA East EU-East
Users from
San Francisco
Users from
New York
Users from
London
Users from
Shanghai
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Guarding against failure of your applications
in one region
Applications
in US West
Applications
in US East
Users from
San Francisco
Users from
New York
AWS Service 1
AWS Service 2
AWS Service 3
AWS Service 4
AWS Service 1
AWS Service 2
AWS Service 3
AWS Service 4
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cost effective DR: Why not use DR all the
time?
DR environments that don’t get used
1. Fall out of sync, eventually
2. Waste money
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key Technology Requirements
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reliable & secure network
Network backbone
VPN or encrypted (SSL) communication over public IP
Region BRegion A Encrypted data
Backbone
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data replication
1. Synchronous
2. Asynchronous
o Nearly continuous (lag of few seconds or minutes)
o Batch (hourly/daily/weekly copy)
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Code synchronization
1. Staged deployment across regions
o Rolling
o Blue-Green
o Rollback facility
2. Parameterized localization
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traffic segregation & management
Segregation options
Explicit – different URLs
• e.g. east.abc-corp.com and west.abc-corp.com
Implicit (DNS level) – the same URL
• e.g. www.abc-corp.com
Traffic management infrastructure
• Throttling
• Internal redirecting
• External redirecting
When users reach wrong
regions
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring
Application & infrastructure health
Replication lag & code sync monitoring
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multi-tenancy
What is a tenant?
-A unit of movement/failover
US EastUS West
Backbone
California
Texas New York
Pennsylvania
Texas
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Direction of replication – before
Texas
containers DB DB
Region A Region B
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Direction of replication – after
Texas
containersDB DB
Region A Region B
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Failover scripts
Failover scripts should be able to:
1. Traffic rerouting for a tenant
2. Change direction of replication
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key architectural considerations
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tolerance for network partitioning
Failure of one region should not lead to failure of applications in
another
Regional independence for request serving – no API calls from one
region to another
Region BRegion A
Backbone
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Minimal data replication requirements
Does all data need to be replicated?
If yes, does it need to replicated synchronously?
Does all data need to be replicated continuously?
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Classification of data
Transactions
Catalog
information
Events,
objects
Server logs
Purchase record Product details Click stream Https logs
Low volume,
but highly
critical
High volume,
but less
critical
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sync vs. async replication modes
Tenant
1
App
DB DB
Tenant
1
App
DB DB
Write
Write
Replicate
Replicate
AckAck
Ack Ack
Cons: Network & target
dependent!
Pros: Guarantees
consistency
Cons: Two databases can go
out of sync
Pros: Network & target
independent
Region A Region B
Async preferred,
when possible!
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Concept of data replication lanes
Synchronous replication
Asynchronous nearly
continuous replication
Asynchronous batch
replication
Transactions
Catalog
information
Most difficult to manage
Easiest to manage
Events
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ideal replication system
Each data store type will need a different technology.
But at the the minimum:
1. It should report replication lag
2. It should report record offset
3. Should be able to retry replication of failed records
ReplicatorSource DB Target DB
High Level
Metrics
monitoring
Try until successful
Replication
lag
Record offset
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Distributed system design best practices
Eventual
consistencyIdempotency
Static
stability
ThrottlingExponential
backup
Circuit
breaking
Reliability Pillar whitepaper (Nov 2017)
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How AWS enables customers to
deploy Multi-Region Active-Active
Architecture
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS worldwide network backbone
43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multi-Region VPN – over non-AWS network
https://aws.amazon.com/answers/networking/aws-multiple-region-multi-vpc-connectivity/
44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multi-Region VPN – over AWS network
https://aws.amazon.com/answers/networking/aws-multiple-region-multi-vpc-connectivity/
45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
INTER REGION VPC PEERING
N E W
46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HQ
OREGON
VPC
Shared
services
VPC
VPC
Customer
network
OREGON REGION
VPC
VPC
VPC VPC
VPC
IRELAND REGION
VPC
Shared
services
VPC
VPC
VPC
VPC
VPC
N O O V E R L A P P I N G I P
A D D R E S S S P A C E
SHARED SERVICES
A M A Z O N B A C K B O N E
V P C P E E R
C R O S S R E G I O N P E E R E D
C O N N E C T I O N E N C R Y P T E D
47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key benefits
1. Works similar to existing Intra Region VPC Peering
2. Data always stays on the AWS backbone
3. Data always encrypted by default
4. No need to use Gateways, third-party VPN solutions to connect across
regions.
5. No additional charges for using interregion VPC peering. Customers pay
standard data transfer rates
48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
49. Amazon RDS Multi-AZ deployment
Availability Zone 1 Availability Zone 2
Security group
mydb1.abc45345.eu-west-1.rds.amazonaws.com:3306
VPC subnetVPC subnet
Synchronous
physical replication
• Standbys ensure zero data loss in event of the master’s failure.
• Always have a stand-by. Always!!!
• Also note, cross AZ failovers are automatic & fast; whereas cross-
region failovers take time and need nontrivial planning.
50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Supported DBs
• MySQL
• MariaDB
• PostgreSQL
• Aurora
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.XRgn
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/AuroraMySQL.Replication.CrossRegion.html
51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring RDS replication lag
52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Very simple plan
All regions send
critical write
traffic to a single
master
Replication traffic
53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Simple, but higher latency and
network dependency”
54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Better plan (for most cases)
Tenant in
Region A
Tenant in
Region B
Tenant in
Region C
Sync
Async
Async
55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Applications get faster response
Some committed transactions may not make it to the failover region, before the switch.
Committed transactions are safe due to standby.
They can be recovered with help of reconciliation techniques.
However
56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zero application downtime from ANY instance failure
Zero application downtime from ANY AZ failure
Faster write performance and higher scale
Sign up for single-region multi-master preview today;
Multi-Region Multi-Master coming in 2018
Availability
Zone 1
Scale out both reads and writes
Availability
Zone 2
Availability
Zone 3
Application
Read/Write
Master 1
Shared distributed storage volume
Read/Write
Master 2
Read/Write
Master 3
Aurora multi-master—scale out reads & writes
F i r s t M y S Q L c o m p a t i b l e D B s e r v i c e w i t h s c a l e o u t a c r o s s m u l t i p l e d a t a c e n t e r s
57. Database Migration Service (DMS) for
granular replication
DMS
Replication instance
Source Target
Update
t1 t2
t1
t2
Transactions Change
apply
after bulk
load
Change data capture
(supported for MySQL, MariaDB,
Aurora, PostgreSQL)
Details: http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html
58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon DynamoDB
F a s t a n d f l e x i b l e N o S Q L d a t a b a s e s e r v i c e f o r a n y s c a l e
Fast, consistent
performance
Highly scalable Fully managed Business critical
reliability
Consistent single-digit millisecond
latency; DAX in-memory
performance reduces response
times to microseconds
Auto-scaling to hundreds
of terabytes of data that
serve millions of
requests per second
Automatic provisioning,
infrastructure
management, scaling,
and configuration with
zero downtime
Data is replicated across
fault tolerant Availability
Zones, with fine-grained
access control
59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon DynamoDB Global Tables ( G A )
Fi r s t f u l l y m a n a g e d , m u l t i - m a s t e r , m u l t i - r e g i o n d a t a b a s e
Build high performance, globally distributed applications
Low latency reads & writes to locally available tables
Disaster proof with multi-region redundancy
Easy to set up and no application rewrites required
Globally dispersed users
Replica (N. America)
Replica (Europe)
Replica (Asia)
Global App
Global Table
61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is AWS Lambda
• It is a serverless, event driven compute service
• Define functions, and the service executes them as many times as input
arrives
• Many supported event sources including: Kinesis and DynamoDB streams
• Scales automatically with event rate
• Supported languages: Java, .Net, Node.js, Python
Ensures streamed records in shards are
processed (replicated in this case) in order!
63. Error handling in Lambda
1. For stream-based event sources (Amazon Kinesis Streams and
DynamoDB streams), AWS Lambda polls your stream and invokes your
Lambda function.
2. Therefore, if a Lambda function fails, AWS Lambda attempts to
process the erring batch of records until the time the data expires from
the stream.
3. The exception is treated as blocking, and AWS Lambda will not read
any new records from the stream until the failed batch of records
either expires or processed successfully.
This ensures that AWS Lambda processes the stream events in order.
http://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ElasticSearch cross region replication
Amazon Kinesis
Firehose
Amazon Kinesis
Firehose
Amazon ES
Amazon ES
Source
Source needs to have tracking
to have successful posting to
both regions
Region A
Region B
65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift cross region replication
Amazon RedshiftAmazon Kinesis
Firehose
Amazon Kinesis
Firehose
Amazon Redshift
Source
Source needs to have tracking
to have successful posting to
both regions
Region A
Region B
66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Global Traffic Management with
Route 53
67. Global service
Load balancers and/or servers across multiple regions
globally
Region A
Region B
Region C
User for Region A
User for Region C
User for Region B
68. Traffic flow: endpoints
• Hybrid/low level infrastructure: IP address or CNAME
• ELB Classic Load Balancer / Application Load Balancer
• Amazon S3 website
• Amazon CloudFront distribution
• AWS Elastic Beanstalk environment
69. Traffic flow: Quick overview of rules
Simple routing policy – Use for a single resource that performs a given
function for your domain, for example, a web server that serves content for
the example.com website.
Failover routing policy – Use when you want to configure active-passive
failover.
Geolocation routing policy – Use when you want to route traffic based on
the location of your users.
70. Traffic flow: Quick overview of rules
Geoproximity routing policy – Use when you want to route traffic based on
the location of your resources and, optionally, shift traffic from one resource
in one location to resources in another.
Latency routing policy – Use when you have resources in multiple locations
and you want to route traffic to the resource that provides the best latency.
Multivalue answer routing policy – Use when you want Amazon Route 53 to
respond to DNS queries with up to eight healthy records selected at random.
Weighted routing policy – Use to route traffic to multiple resources in
proportions that you specify.
72. Example of an advanced policy
Failover Rule Geolocation Rule
Failover Rule
End points
73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cross-region code deployment
74. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BLUE-GREEN
Link
75. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CANARY
Link
76. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Is code deployment any different?
No. DevOps pipelines work the same way.
Important trade-off you should make:
1. Simultaneous deployments
2. “One region at a time” deployments
77. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using AWS services
https://aws.amazon.com/blogs/devops/building-a-cross-regioncross-account-code-deployment-solution-on-aws/
78. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Parameter localization
App 1 App 2 App 1 App 2
Local parameters DB Local parameters DB
Region A Region B
Central parameter
configuration application
DevOps pipeline deploying
same code base
Take a look at Netflix
Archaius!
79. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cross Region Monitoring
80. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Is monitoring any different?
Yes. You need to do additional monitoring.
1. Replication lags (very critical)
2. Record offset tracking
Need to put important stats for two/more regions on a
single canvas for proper monitoring.
81. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring replication lag – CloudWatch
82. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring Amazon S3 file replication progress
https://aws.amazon.com/answers/infrastructure-management/crr-monitor/
Info on new
files
Info on
replicated files
83. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring using serverless components
Arrow direction indicates general direction of data flow
EC2 instances
VPC
Flow Logs
CloudTrail
Audit Logs
S3
Access
Logs
ELB
Access
Logs
CloudFront
Access
Logs
SNS
Notifications
DynamoDB
Streams
SES
Inbound
Email
Cognito
Events
Kinesis
Streams
CloudWatch
Events &
Alarms
Config
Rules
AmazonKinesis
Firehose
AmazonKinesis
Firehose
Kinesis
Agent
https://d0.awsstatic.com/whitepapers/whitepaper-use-amazon-elasticsearch-to-log-and-monitor-almost-everything.pdf
Highly available
deployment across
two AZs
84. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Not all metrics are created equal!
High-Level Metric (high importance, low volume):
• User experience
• User count
• Transaction count
• Replication status
Low-Level Metric (relatively low importance, high volume):
• HTTP request count
• Read vs. write throughout
• Cache hit vs. miss
85. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
High-Level Metrics monitoring
Low-Level Metrics
monitoring
Low-Level Metrics
monitoring
High-Level
Metrics
monitoring
High-Level Metrics
monitoring
App monitoring agents App monitoring agents
Region A Region B
Replicate only
High-Level Metrics.
Use region tags.
86. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Some useful open source projects
87. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Zuul & Isthmus from Netflix
Request rerouting for users coming from
unwanted region, handling load balancer failures
88. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EVCache from Netflix
For remote cache invalidation and synchronization
89. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusions
90. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusions
o Avoid synchronous replication & simultaneous deployments as
much as possible
o Design applications for idempotency & eventual consistency as
much as possible
o Closely monitor replication & code sync delays
o Have push buttons ready to switch traffic for tenants
o Make High-Level Metrics monitoring systems also Multi-Region
91. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusions
o It is an involved exercise. It requires careful planning
and design.
o However, various AWS services make implementation
much easier by doing undifferentiated heavy lifting
for our customers.
92. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Conclusions
o For companies with extremely high availability
requirements or/and geographically distributed user
base, benefits of Multi-Region Active-Active
architecture can be profound.
o In such cases, consider designing applications for
Multi-Region Active-Active implementation from DAY
ONE.
But, again as we say in Amazon, today is DAY ONE!
93. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!