Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Preparing for Multi-Cloud

537 views

Published on

The presentation describes reasons for selecting multi-cloud operation approach and provides an overview of implementation challenges and how they can be addressed

Published in: Software
  • Be the first to comment

  • Be the first to like this

Preparing for Multi-Cloud

  1. 1. ©2019, Intechsystems SIA Preparing for Multi-Cloud Operation 1 20.06.2019 By Konstantin Tjuterev & Oleg Andreyev
  2. 2. ©2019, Intechsystems SIA Why 2 speakers? • Less slides to prepare for each of us • We’d like to show our case from 2 perspectives • Business and high-level architecture • Nitty-gritty details of implementation 2
  3. 3. ©2019, Intechsystems SIA About Oleg Andreyev • Senior Software Architect @ Intexsys • 7+ years in Software Development 3
  4. 4. ©2019, Intechsystems SIA About Konstantin Tjuterev • Founder and Chief Architect @ Intexsys • 20+ years in Software Development 4
  5. 5. ©2019, Intechsystems SIA Agenda • Why do we need Multi-cloud? • Challenges – why it’s complicated • Addressing the challenges 5
  6. 6. ©2019, Intechsystems SIA Initial state • Top 500 Online Retailer in USA • Existing Proprietary E-commerce Platform • Multi-component stack (PHP/Symfony, MySQL, Elasticsearch, Cassandra, RabbitMQ, HAProxy, Varnish, Nodejs) • 15 online stores • 1 000 000+ items sold • $300 million annual turnover • Hosted on AWS since 2018 6
  7. 7. ©2019, Intechsystems SIA Goals • Average of $820K daily sales • Downtime cost is at least $500/minute (820K/24/60) • In reality, it can go as high as $5000/minute during Black Friday 7
  8. 8. ©2019, Intechsystems SIA What IF? • What if AWS goes down? • Never happened? • But it DID • And multiple times 8
  9. 9. ©2019, Intechsystems SIA What AWS outage causes The four-hour AWS outage caused S&P 500 companies to lose $150 million, Cyence, a startup that models the economic impact of cyber risk, estimated, a Cyence spokeswoman said via email. US financial services companies lost $160 million, the firm estimated. That estimate doesn’t include countless other businesses that rely on S3, on other AWS services that rely on S3, or on service providers that built their services on Amazon’s cloud https://www.datacenterknowledge.com/archives/2017/03/02/aws-outage-that-broke-the-internet-caused-by-mistyped-command 9
  10. 10. ©2019, Intechsystems SIA What happened? 10
  11. 11. ©2019, Intechsystems SIA 11
  12. 12. ©2019, Intechsystems SIA What IF? • What if we have a major problem in one of the (clustered) services? • Elasticsearch cluster issue • MySQL master issue • What if we push a wrong button in some infrastructure/deployment automation tool? 12
  13. 13. ©2019, Intechsystems SIA Disaster Recovery Options • Restoring from back-ups • Snapshots of virtual machines/Database dumps • Will have to spin up the whole infrastructure • Cold stand-by • A set of prepared but stopped virtual machines • Database can be started, but dump must be restored • Hot stand-by • A set of running virtual machines not serving the traffic • Running database replicas 13
  14. 14. ©2019, Intechsystems SIA Criteria • Single/Shared points of failure • Time to recovery / potential losses from outage ($5K/minute) • Time to switching back after restoring operation of the primary infrastructure / potential losses • Cost of implementation / Complexity • Cost of maintenance • Additional benefits 14
  15. 15. ©2019, Intechsystems SIA Comparison Option Point of failure Time to recovery/switching back Complexity/Costs Backup If backups are in the same cloud (AWS) or potential restoring is to the same cloud - single Very Long – 24h at best (spinning up and reconfiguring the whole infrastructure) Low / Very low (just storage) Cold stand-by Depends (can be put in a different cloud/data- center) Medium – 12+h (database restore) if the cold infrastructure is up to date High / Medium Hot stand-by Depends (can be put in a different cloud/data- center) Low – less than 1h Very High / High 15
  16. 16. ©2019, Intechsystems SIA What is Multi-Cloud Operation? • Not a Disaster Recovery – just always running production traffic from multiple independent clouds • No single point of failure • Almost instant recovery in case of Cloud outage - just all traffic is served by surviving Cloud • No “failover/switching back” – when Cloud is restored after outage, we’ll just start sending traffic there • High complexity/cost, but much better reliability • Continuously live-tested (monitoring, deployment, real customers) 16
  17. 17. ©2019, Intechsystems SIA Additional benefits • Blue/green deployments on the whole infrastructure scale • Running infrastructure related experiments in isolated, but production environment • Ability to benefit from cost differences between cloud providers (given that we’re paying for disaster recovery anyway) 17
  18. 18. ©2019, Intechsystems SIA Why not just AWS Multi-AZ? • Sometimes AWS fails in all Availability Zones • Vendor lock • Complexity of Multi-AZ setup is similar to Multi-Cloud, just shifted • Single cloud setup becomes easier (just use 1 AZ) • Cross-cloud setup becomes more complicated • With the same overall complexity we can get better results • Better protection – no single point failure 18
  19. 19. ©2019, Intechsystems SIA Challenges • Pushing source data to Multiple Clouds • Data Synchronization between Clouds • Deployment • Dependencies • Scheduled jobs • Traffic balancing • Monitoring/Alerting 19
  20. 20. ©2019, Intechsystems SIA Pushing data • RabbitMQ in the office • Clouds pulling messages and updating data in real-time • Incoming traffic in Clouds is free • Read-only databases replication from the office 20
  21. 21. ©2019, Intechsystems SIA Data Synchronization between Clouds • This is the most challenging part • We need to replicate relational data (such as orders, users) between multiple clouds • We’re using MySQL and are not planning to change that • So, how to replicate data between clouds? 21
  22. 22. ©2019, Intechsystems SIA MySQL Real Master-Master replication • Master MySQL nodes running in different clouds • Both writing Binary logs and executing from each other • With Multi-Cloud we need to support writes from both clouds • Initially we were using Auto-increment primary key (as everyone does) • It won’t work with Master-Master 22
  23. 23. ©2019, Intechsystems SIA What will happen if… • John Doe and Peter Doe will both create an account/order • Requests will be handled by different Cloud
  24. 24. ©2019, Intechsystems SIA 24
  25. 25. ©2019, Intechsystems SIA Replication conflict • Replication will stop • Replication can be fixed manually by ignoring error • Multi-Cloud is out of sync
  26. 26. ©2019, Intechsystems SIA How to avoid such situation? • Setup MySQL Cluster • or setup Percona XtraDB Cluster • or setup MariaDB Galera Cluster
  27. 27. ©2019, Intechsystems SIA But... • “all or nothing approach” • Your application needs to handle COMMIT • COMMIT slowness = slowest node in cluster • Network round-trip time / Certification time / Local apply • We are not building a cluster…
  28. 28. ©2019, Intechsystems SIA 28
  29. 29. ©2019, Intechsystems SIA Other solutions • Primary Key Auto Increment step for each server (even/odd) • Primary Key that will not collide
  30. 30. ©2019, Intechsystems SIA Universally unique identifier - UUID • It’s a 128-bit number • It’s a 32 hexadecimal digits (128/4) • Can be referred as GUID
  31. 31. ©2019, Intechsystems SIA Versions of UUID • Nil UUID – special case of UUID which is equal to NULL and all zeros • UUID v1 – generated from a time and a node id (MAC address) • UUID v2 – generated from an identifier, time, and a node id • UUID v3 – generated by hashing a namespace name-based (md5) • UUID v4 – generated using a random or pseudo-random number. • UUID v5 – same as v3 but using sha1 • UUID v6 – optimized version of UUID v1 (unofficial)
  32. 32. ©2019, Intechsystems SIA UUID v1 • It is time based (sorting will not suffer much) • It can be stored optimized in 16-bytes • Maximal Average Rate 163 billion per second per node • Can be tracked back to the server that created it • Optimized B-Tree • Less storage required for 16-bytes then for 32 characters • To UUID or not to UUID ? • Storing UUID Values in MySQL
  33. 33. ©2019, Intechsystems SIA UUID v1 Structure 33
  34. 34. ©2019, Intechsystems SIA Optimized UUID v1 34
  35. 35. ©2019, Intechsystems SIA But conflicts are still possible… • Conflicts are possible but not with PK • Conflicts can be caused by other unique key
  36. 36. ©2019, Intechsystems SIA 36
  37. 37. ©2019, Intechsystems SIA But it’s very unlikely to happen because • Normal replication delay < 1s • Customer cannot send requests that fast with same data to different cloud
  38. 38. ©2019, Intechsystems SIA What data to replication between Clouds? • Each information has it’s source – need to clearly understand that • Data which is generated by end user (customer/or server) • Data which is pushed into Cloud by us
  39. 39. ©2019, Intechsystems SIA How to do Database Migrations • Follow “zero” downtime migrations practices • Avoid table locking • Use ALGORITHM=INPLACE, LOCK=NONE when possible • Do not deploy code that writes into column first • Always think about Backward compatibility usually without revert • Run DROP and RENAME after you are fully satisfied • It’s better to run ALTER manually - more predictable • Always remember that you are running in Multi-Cloud/Hot-standby
  40. 40. ©2019, Intechsystems SIA Another safety-check for developers • Create separate users for two types of tables with DDL • Table that are populated by customer • Table that are populated by us • Remove DDL permissions from main user • Group migrations by “category” • Before deploying to another Cloud make sure it has SBM = 0
  41. 41. ©2019, Intechsystems SIA How to deploy to Multi-Cloud • Make sure your application is Cloud agnostic • Store config in the environment (The Twelve Factors) • Do not deploy application to all Clouds simultaneously • Backward compatibility
  42. 42. ©2019, Intechsystems SIA How to deploy assets (JS/CSS) • Figure out assets lifetime • Make sure you support few old versions of assets (cache) • Make sure your assets are Backward Compatible • If you have some persisted data in Customer space (cookies, local storage) make sure it compatible between versions • Monitor and logs your assets • Make sure that assets hash is auto generated
  43. 43. ©2019, Intechsystems SIA Asset Lifetime 43
  44. 44. ©2019, Intechsystems SIA Distributed CRON • Do not directly configure CRON on servers • Scheduling MUST be delegated to independent system • Determine your clients • Handle VM “death” – you should be able to switch job fast • https://mesos.github.io/chronos/ • https://dkron.io
  45. 45. ©2019, Intechsystems SIA Other facts • We had to upgrade MySQL twice within 6 months, 5.5 -> 5.7 -> 8.0 • 5.7 – GTID, Replication channels • 8.0 – Replication Filter per Channel • Use GTID (Global Transaction ID) for consistency • Use AUTO_POSITION for replication (only with GTID)
  46. 46. ©2019, Intechsystems SIA Brief Summary • Use UUID to avoid conflicts with Primary Key • Determine what data needs to be synced • Monitor your replication with all possible tools • Use distributes CRON • Monitor and log your JS/CSS • Remember about CAP theorem 46
  47. 47. ©2019, Intechsystems SIA Traffic balancing • DNS – weight-based with health checks • WAF/CDN + Rules Engine (on CDN Edges) • Location stickiness 47
  48. 48. ©2019, Intechsystems SIA Routing with cloud stickiness AWS AZURE Weight-based DNS WAF / СDN Sticky Cookie Present? www.site.com Alive NO Request Cookie = AWS?Yes Yes No Set-Cookie: cloud=aws Set-Cookie: cloud=azure balanced.site.com Health- based DNS 90% aws.site.com AWS Outage Health- based DNS 10% azure.site.com Alive AzureOutage CDN/Edges Response 48
  49. 49. ©2019, Intechsystems SIA Summary • Not everyone needs Multi-cloud • You need to have clear reasons to do go Multi-cloud • Disaster recovery • Speed (geo-based) • It’s challenging and costly • But doable even with basic tools/stack (PHP/MySQL) 49
  50. 50. ©2019, Intechsystems SIA Q&A 50

×