Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Creating a Reference Architecture for High Availability at Nokia

113 views

Published on

M|18 Creating a Reference Architecture for High Availability at Nokia

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 Creating a Reference Architecture for High Availability at Nokia

  1. 1. Creating a Reference Architecture for High Availability at Nokia Rick Lane Consulting Member of Technical Staff Nokia
  2. 2. Geo-redundant HA reference architecture
  3. 3. Nokia Common Software Foundation (CSF) ● Central development organization for commonly used components ● Work with PUs within company to determine needs/plans ● Determine which open-source products are used/planned by multiple PUs ● Productize highly used open-source products in central organization to eliminate silo-based development model to ○ Reduce overall corporate development costs ○ Improve time to market for products
  4. 4. CSF OS/DB Department ● Centralized OS (Linux) and DB development and distribution ● Each DB component will define common reference architecture(s) ● CSF DB component develops the following around the open-source DB for each reference architecture: ○ Deployment and Life Cycle Management (LCM) of database in Cloud/openstack environment ○ Docker container for deployment and LCM via kubernetes/Helm → PaaS ○ Tools/scripts to support deployment and management of DB as above, including ansible ● Centralized support for OS and DB within corporation
  5. 5. CSF CMDB (MariaDB Component) ● MariaDB was selected as one of the CSF DB components ● CMDB provides DB server with following reference architectures: ○ Standalone ○ Galera Cluster w/ or w/o Arbitrator ○ Replication: Master/Master and Master/Slave ● Galera or Master-Master w/JDBC mostly used for basic HA needs ● Deployment and LCM developed and packaged for: ○ Cloud/openstack ansible automated deployment for all reference architectures ○ Automated kubernetes container deployment via Helm charts
  6. 6. HA clustered DB Geo-Redundant HA Requirements ● Multi-datacenter deployment (dual data center initially) ● Must have HA clustering at each datacenter ● 99.999% availability ● 5 second WAN replication lag tolerance ● 6 hours of Transaction Buffering for datacenter disconnect ● Automatic replication recovery after datacenter disconnect ● Procedure to recover standby data center from active datacenter after 6 hours of replication failure APPLICATION APPLICATION DC-A DC-B HA clustered DB
  7. 7. Geo-Redundant Application Assumptions ● Application will write to only one datacenter at a time ● Application is responsible for failing over to alternate datacenter on failure ● Inter-DC replication lag and binlog buffering are a function of application transaction rate and WAN speed and reliability ● Some data loss acceptable on failure
  8. 8. Architecture Alternatives ● Galera Cluster at each DC with segment_id definition to limit inter-DC traffic over a single link ○ Achieve HA at each DC with minimal data loss (internal) ○ Would ideally require 3 datacenters with 3 nodes each for quorum requirements ○ Synchronous replication between DCs could cause severe performance impacts for more write-intensive applications ● Master-Master at each DC ○ Can provide intra-DC HA via JDBC ○ Auto_increment_offset/increment would have to be managed for all nodes at all DCs ○ Still have inter-DC replication path adjustments needed on local DC failover ● Still need node to run service to monitor alternate DC health and alarm ○ MaxScale makes sense here to provide proxy and manage local cluster ○ Nokia added services will monitor alternate DC health via MaxScale
  9. 9. Geo-Redundant Reference Architecture read write MASTERSLAVES readwrite MASTER SLAVES APPLICATION APPLICATION DC-A DC-B Master/Slave at each datacenter for HA and auto-failover Masters in each datacenter cross- replicate to each other
  10. 10. Datacenter HA (automatic failover) ● First started working with replication-manager to support auto-failover ○ Replication-manager runs along-side MaxScale and performs auto-failover ○ MaxScale expected to auto-detect topology changes and update status - not always ○ Replication-manager configuration very complex ● Discovered maxscale-2.2 with auto-failover built in ○ Single point topology manager (no inconsistencies) ○ No additional 3rd party open-source plugins required (fully supported) ○ Worked with MaxScale development team as beta-tester
  11. 11. MaxScale-2.2 Initial Testing ● Very simple configuration ignore_external_masters = true auto_failover = true auto_rejoin = true ● Maxscale provides full datacenter HA ○ Automatic promotion of a Slave as new Master when Master fails ○ Rejoin of previous failed master as Slave when recovers ○ Maxscale manages inter-DC cluster, we would add scripts to fix replication in both directions when Master failover occurs in either DC ○ Huge simplification over using 3rd party replication managers / deterministic behavior ○ Maxscale now has crash-safe recovery
  12. 12. MaxScale-2.2 Final Testing ● New configuration ignore_external_masters = false auto_failover = true auto_rejoin = true ● Additional behavior ○ Changed to set ignore_external_masters = true ○ New Master automatically replicates from same external Master as failed Master ○ Notification script plugin on topology changes allows us to automatically fix alternate datacenter Master to replication from new Master on “new_master” notify event ○ Failover very quick (500ms monitor interval) - 2 to 3 seconds?
  13. 13. Master/Slave Database HA with MaxScale • Automatic Failover: election and promotion of slave by MaxScale • Continues routing read queries to slave during failover • Manual failover and switchover interface readwrite MASTER SLAVES MAXSCALE 1 1 Master fails 2 MaxScale elects new master 3 MaxScale promote candidate slave as new master 4 MaxScale instructs rest of the slave of new master 5 MaxScale sets new master to replication from old Master external server (if existed) 6 MaxScale calls new_master event notify - Nokia script to fix external master to replication to new local master 2,34
  14. 14. Containerized Delivery ● Deployment of Master/Slave container cluster ○ Automatic configuration of first container as Master, other two Slave ○ All containers will automatically recover as Slave role ○ Advertise IPs via etcd service advertisement ● Deployment of Maxscale container ○ Gets all config from kubernetes manifest file and server IPs from etcd advertisement ○ Automatically configures maxscale and starts monitor of DB M/S cluster ● Container failure and re-deployed with different IP ○ Developed service to monitor etcd IP advertisements and detect host IP change at MaxScale ○ Run: maxadmin alter server <hostname> address=<ip>
  15. 15. Nokia developed capabilities ● Notify script plugin on “new_master” event to set remote DC master to new promoted master ● Additional service will be developed to perform the following additional functions: ○ Monitor MaxScale in alternate datacenter to verify datacenter health and verify local master replicating to correct external master ○ Monitor replication to generate SNMP traps when replication breaks ○ Monitor replication lag and generate SNMP traps when lag exceeds threshold ○ Possible implementation of replication CHANGE MASTER retry if replication fails due to GTID not found ○ Since log_slave_updates must be set, configurable “slave flush interval” to flush binlogs in slaves
  16. 16. Future work ● Still need to see how replication will hold during failure under load conditions ● MaxScale enhancement to full automate inter-DC master failover (both directions) ● Support Galera Cluster as HA solution in each DC ● Support MaxScale HA cluster at each DC (via keepalived) ● Support segregated database access on both databases at same time?
  17. 17. Thank you! Q&A

×