AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

268 views

Published on

In gaming, low latencies and connectivity are bare minimum expectations users have while playing online on PlayStation Network. Alex and Dustin share key architectural patterns to provide low latency, multi-region services to global users. They discuss the testing methodologies and how to programmatically map out a large dependency multi-region deployment with data-driven techniques. The patterns shared show how to adapt to changing bottlenecks and sudden, several million request spikes. You’ll walk away with several key architectural patterns that can service users at global scale while being mindful of costs.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
268
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
41
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

AWS re:Invent 2016: Moving Mission Critical Apps from One Region to Multi-Region active/active (ARC309)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alexander Filipchik – Principal Engineer, Sony Interactive Entertainment Dustin Pham – Principal Engineer, Sony Interactive Entertainment David Green – Enterprise Solutions Architect, Amazon Web Services Moving Mission-Critical Apps from One Region to Multi-Region active/active November 30, 2016 ARC309
  2. 2. Thank you
  3. 3. What to expect from the session • Architecture Background • AWS global infrastructure • Single vs Multi-Region? • Multi-Region AWS Services • Case Study: Sony’s Multi-Region Active/Active Journey • Design approach • Lessons learned • Migrating without downtime
  4. 4. AWS Global Infrastructure
  5. 5. AWS worldwide locations Region (14) Coming Soon (4)
  6. 6. AWS worldwide locations
  7. 7. Region topology
  8. 8. Transit Transit AZ AZ AZ AZAZ Region topology
  9. 9. Transit Transit AZ AZ AZ AZAZ Availability Zone
  10. 10. Availability Zone Transit Transit AZ AZ AZ AZAZ
  11. 11. Single region high-availability approach • Leverage multiple Availability Zones (AZs) Availability Zone A Availability Zone B Availability Zone C us-east-1
  12. 12. Reminder: Region-wide AWS services • Amazon Simple Storage Service (Amazon S3) • Amazon Elastic File System (Amazon EFS) • Amazon Relational Database Services (RDS) • Amazon DynamoDB • And many more…
  13. 13. OK … should I use Multi-Region?
  14. 14. Good Reasons for Multi-Region • Lower latency to a subset of customers • Legal and regulatory compliance (i.e. data sovereignty) • Satisfy disaster recovery requirements
  15. 15. AWS Multi-Region services
  16. 16. Multi-Region services • Amazon Route 53 (Managed DNS) • S3 with cross-region replication • RDS multi-region database replication • And many more… • EBS snapshots • AMI
  17. 17. Amazon Route 53 • Health checks • Send traffic to healthy infrastructure • Latency-based routing • Geo DNS • Weighted Round Robin • Global footprint via 60+ POPs • Supports AWS and non-AWS resources
  18. 18. prod-1 prod-2 95% 5% example.net health health + weight Example: Weighted with failover prod.examp.net examp-fail.s3-website
  19. 19. S3 – cross-region replication Automated, fast, and reliable asynchronous replication of data across AWS regions • Only replicates new PUTs. Once S3 is configured, all new uploads into a source bucket will be replicated • Entire bucket or prefix based • 1:1 replication between any 2 regions / storage classes • Transition S3 ownership from primary account to sub-account Use cases: • Compliance—store data hundreds of miles apart • Lower latency—distribute data to regional customers • Security—create remote replicas managed by separate AWS accounts Source (Virginia) Destination (Oregon)
  20. 20. RDS cross-region replication • Move data closer to customers • Satisfy disaster recovery requirements • Relieve pressure on database master • Promote read-replica to master • AWS managed service
  21. 21. RDS cross-region replication
  22. 22. Leverage existing resources
  23. 23. Many resources exist AWS Reference Architecture Implementation Guides
  24. 24. What to expect from the session • Architecture Background • AWS global infrastructure • Single vs Multi-Region? • Enabling AWS services • Case Study: Sony Multi-Region Active/Active • Design approach • Lessons learned • Migrating without downtime
  25. 25. Who is talking? Alexander Filipchik (PSN: LaserToy) Principal Software Engineer at Sony Interactive Entertainment Dustin Pham Principal Software Engineer at Sony Interactive Entertainment
  26. 26. Our active/active story
  27. 27. Small team, large responsibility • Service team ran like a startup • Less than 10 core people working on new PS3 store services • PSN’s user base was already in the several hundred millions of users • Relied on quick iterations of architecture on AWS
  28. 28. Social
  29. 29. Video
  30. 30. Commerce
  31. 31. MULTIPLE NEW VIRTUAL REALITY PLATFORM LAUNCHES OF VARYING EXPERIENCE LEVEL THE YEAR OF VR Cardboard
  32. 32. Transforming the store
  33. 33. Delivered new store • Great job, now onto the PS4 • PS4 launch – 1 million users at once on Day 1, Hour 1 • Designing for many different use cases at scale
  34. 34. Architecture phases Proof of Concept Scale Optimize Make Highly Available SF Bay
  35. 35. Next step: make highly available • Highly available for us: multiregion active/active • Raising key questions: • How does one move a large set of critical apps with hundreds of terabytes of live data? • How did we architect every aspect to allow for multiregional, active-active? • How do we turn on active-active without user impact? • User impact includes Hardware (ps3/ps4/etc.) and Game partners! • Where do we even begin?
  36. 36. Starting with applications
  37. 37. Applications • First question to answer: What does it mean to be multiregional? • Different people had different answers: • Active/stand-by vs. active/active • Full data replication vs. partial • Automatic failover vs. manual • Etc.
  38. 38. After some healthy discussions
  39. 39. Agreement • “You should be able to lose 1 of anything” approach. • Which means, we should be able to survive without any visible impact losing of: • 1 server • 1 Availability Zone • 1 region
  40. 40. Starting with uncertainty • Multiple macro and micro services • Stateless and stateful services • They depend on multiple technologies • Some are multiregional and some are not • Documentation was as always: out of date
  41. 41. Inventory of dependencies 0 10 20 30 40 50 60 70 80 90 100 Tech %ofapplications
  42. 42. What is multiregional by design? With some customizations
  43. 43. Stages of grief • Denial – can’t be true, let’s check again • Anger – we told everyone to be active/active ready!!! • Bargaining – active/stand-by? • Depression – we can’t do it • Acceptance – let’s work to fix it, we have 6 months…
  44. 44. What it tells us • We can’t just put things in two regions and expect them to work • We will need to do some work to: • Migrate services to technology which is multiregional by design • Somehow make underlying technology multiregional
  45. 45. Scheduling/optimization problem • There is work that should be done on both apps and infrastructure side • We need to schedule it so we can get results faster and minimize waits • And we wanted machine to help us
  46. 46. The world’s leading graph database That can store a graph of 30B nodes Here to help us to deal with our problem
  47. 47. Why Neo4J • Graph engine and we are dealing with a graph • Query language that is very powerful • Can be populated programmatically • Can show us something we didn’t expect
  48. 48. How to use it? • Model • Identify nodes and relations • Tracing • Code analyzer • Talking to people • Generate the graph • Run queries
  49. 49. Model example • Nodes • Users • Technology: (Cassandra, Redis) • multiregional: true/false • Service (applications) • stateless: true/false • Edges • Usage patterns (read, write)
  50. 50. Graph definition example
  51. 51. Graph example Can be enriched with: • Load balancers • Security groups • VPCs • NATs • Etc.
  52. 52. Ours looked more like
  53. 53. And running some Neo4j magic This one is important
  54. 54. Shows you what is ready to go
  55. 55. What to do next • Validate multiregional technologies do actually work • Figure out what to do with non-multiregional technologies • Move services in the following order:
  56. 56. Validating our main DB (Cassandra) A lot of unknowns: • Will it work? • Will performance degrade? • How eventual is multiregional eventual consistency? • Will we hit any roadblocks? • Well, how many roadblocks will we hit? 
  57. 57. What did we know? Netflix is doing it on AWS and they actually tested it They wrote 1M records in one region of a multiregion cluster 500 ms later read in other clusters was initiated All records were successfully read
  58. 58. Well… Some questions to answer: Should we just trust the Netflix’s results and just replicate data and see what happens? Is their experiment applicable to our situation? Can we do better? Break Something Free Coffee Say, "there's gotta be a better way to do this" HOW TO GET AN ENGINEER'S ATTENTION
  59. 59. Cassandra validation strategy • Use production load/data • Simulate disruptions • Track replication latencies • Track lost mutations • Cassandra modifications were required
  60. 60. Preparation Exporter Region 1 Region 2 Ingester Ingester
  61. 61. Test Read/Write Loader Region 1 Read/Write Loader Region 2
  62. 62. Analysis
  63. 63. Sample results (usw1-usw2) 1 10 100 1000 10000 100000 1000000 10000000 61714 61716 61718 61720 61722 61724 61726 61728 61802 61804 61806 61808 61810 61812 61814 61816 61818 61820 61822 61824 61826 61828 61830 61832 61834 61836 61838 61840 61842 61844 61846 61848 61850 61852 61854 61856 61858 61900 61902 61904 61906 61908 61910 61912 61914 Two DC connection cut-off and recovery ( latency in logarithmic scale) Pct95 Pct99 Pct999 MaxLag
  64. 64. Things that are not multiregional by design We gave teams 2 options: • Redesign if is critical to user’s experience • If not in the critical path (batch jobs) • active/passive • master/slave • Use Kafka as a replication backbone (recommended)
  65. 65. Solr example (pre active/active) Indexer Master App1 App2 Replicator Replicator Read Replicas Read Replicas
  66. 66. Solr example (easy active/active) Indexer Master Replicators Read Replicas Apps Replicators Read Replicas Apps Region 1 Region 2
  67. 67. Solr example (Kafka active/active) Indexer Read Replicas Apps Region 1 Solr Indexer Indexer Read Replicas Apps Region 2 Solr Indexer
  68. 68. Are we missing anything? Yes, infrastructure
  69. 69. Decompose and recompose
  70. 70. Breaking up the system into moveable parts App + caching tier Data tier Inbound tier Outbound tier Clients
  71. 71. Phase 1: Infrastructure Private Subnet Public Subnet ELBs Inbound tier Outbound Tier Infrastructure to build/move: • VPCs • Subnets • ACLs • ELBs • IGW • NAT • Egress
  72. 72. Phase 1: Infrastructure key points • Building infrastructure in new region must be fully automated (Infrastructure as Code) • Regional communication decisions • VPNs? • Over Internet? • Do infrastructures have to match exactly? • 1st region evolved organically • 2nd region should be blueprint for all new region DCs
  73. 73. Phase 2: Data Public subnet ELBs Data tier Inbound tier Outbound tier
  74. 74. Phase 2: Data option 1 replication over VPN Public Subnet ELBs Data tier Inbound tier Outbound tier Region 2 VPN
  75. 75. Phase 2: Data option 1 replication over VPN • Pros • Setting up VPN with current network architecture would be easier on data tier • Secure • Managing data nodes intercommunication is straight forward and has lower operational overhead • Cons • Limit on throughput • Data set is large and can quickly saturate VPN • Scaling more applications in future will be complicated!
  76. 76. Phase 2: Data option 2 replication over ENIs with public IPs Private subnet Public subnet ELBs Data tier Inbound tier Outbound tier Region 2 SSL SSL
  77. 77. Phase 2: Data option 2 replication over ENIs with public IPs • Pros • Not network constrained • Able to add more applications + data without need of building new infrastructure to support • Cons • Operationally, more orchestration (Cassandra, for example, needs to know other node Elastic IPs) • Internode data transfer security is a must
  78. 78. Phase 3: App tier + cache strategy Outbound Tier Region 2
  79. 79. Phase 3: App tier + cache strategy • Applications communicate within a region only • Applications do not call another region’s databases, caches, or applications • Isolation creates for predictable failure cases and clearly defines failure domains • Monitoring and alerting are greatly simplified in this model
  80. 80. Phase 4: Client routing Region 1 Region 2 DNS
  81. 81. Phase 4: Client routing • Predictable “sticky” routing to avoid user bounce via Georouting • Data replication manages cross region state • Allows for routing to stateless services • Ability to do % based routing to manage different failure scenarios
  82. 82. Putting it all together
  83. 83. Software design for multiregion deployments • Typical software architecture APIs Business Logic Data Access Cross Cutting Config
  84. 84. Software design for multiregion deployments Region 1 Region 2 Remember when we mentioned to have application tier call patterns to be isolated in a region? How do we achieve this simply?
  85. 85. Software configuration approaches • An application config to connect to a database could look like: cassandra.seeds=10.0.1.16,10.0.1.17 • A naïve approach would be to have an application have multiple configs per deployable depending on its region cassandra.seeds.region1=10.0.1.16,10.0.1.17 cassandra.seeds.region2=10.0.2.16,10.0.2.16 • This, of course, results in an app config management nightmare, especially now with 2 regions
  86. 86. Software configuration approaches • What if we implemented a basic “central" way of configuration Region x Region x Local DB Where are my C* Seeds? IPs are x.x.x.xcassandra.seeds=cass- seed1, cass-seed2 cass-seed1 resolves to x.x.x.x
  87. 87. Simplified software configuration (context) • Context is made available to application which contains: • Data Center/region • Endpoint short-name resolution • Environment (Dev, QA, Prod, A/B) • Database connection details • Context is the responsibility of the infrastructure itself and is provided through build automation, AWS tagging, etc. • App is responsible for behaving correctly off of context
  88. 88. Infrastructure as code • New regions must be built through automation • Specification of services to Terraform • Internal tool and DSL was built to manage domain specific needs • Example: • Specify an app requires Cassandra and SNS • Generates Terraform to create security groups for ports 9160, 7199-7999, build SNS, build ELB for app, etc.
  89. 89. Database automation • Ansible run to assist in build Cassandra in public subnet and associate EIPs to every new node • Manages network rules (whitelisting) • Manages certificates and SSL Private Subnet Public Subnet ELBs Outbound Tier Region 2 SSL SSL
  90. 90. Monitoring multiregional deployments
  91. 91. Monitoring through proper tagging • Part of the “Context” applications are aware of is the region • Adds “region” to any app logs • Region tags then added in metrics and can be surfaced in grafana or any monitoring of your choice • Cross-regional monitoring key metrics and alerting • Data replication (hints in Cassandra, seconds behind master in MySQL, etc.) • Data in/out
  92. 92. Putting it all together Region 1 Region 2 Create infrastructure Replicate DNS
  93. 93. Lessons learned
  94. 94. Lessons learned • Data synchronization is super critical, so dependency map based off of the data technologies first. • Always run your own benchmarking. • Do not allow legacy to control other region’s design. Find a healthy transition and balance between old and new. • Applications must be context-driven. • Depending on your data load, Cross-regional VPNs may not make sense.
  95. 95. PlayStation is hiring in SF: Find us at hackitects.com
  96. 96. Thank you!
  97. 97. Remember to complete your evaluations!
  98. 98. Related Sessions

×