Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Choosing the Right High Availability Strategy for You

85 views

Published on

M|18 Choosing the Right High Availability Strategy for You

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 Choosing the Right High Availability Strategy for You

  1. 1. MariaDB High Availability Gerardo “Gerry” Narvaja Ulrich Moser
  2. 2. High Availability Defined In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability – Wikipedia up time / total time
  3. 3. Approach to HA 3.7 days / year Backup / Restore 1 < 99.9% 52.6 min / year Replication / Automatic failover 3 ~ 99.99% 8.8hs / year Simple replication / manual failover 2 ~ 99.9% 5.3 min / year Galera Cluster ~ 99.999% 4 5 Other Strategies for High Availability
  4. 4. An average of 80 percent of mission-critical application service downtime is directly caused by people or process failures. The other 20 percent is caused by technology failure, environmental failure or a disaster Gartner Research
  5. 5. High Availability Components High availability is a system design protocol and associated implementation that ensures a certain degree of operational continuity during a reference period. For stateful services, we need to make sure that data is made redundant. It is not a replacement for backups! Data Redundancy Some mechanism to redirect traffic from the failed server or Datacenter to a working one Failover or Switchover Solution Availability of the services needs to be monitored, to take action when there is a failure or even to prevent them Monitoring and Management
  6. 6. HA Terminology
  7. 7. General Terms • Single Point of Failure (SPOF) – An element is a SPOF when its failure results in a full stop of the service as no other element can take over (storage, WAN connection, replication channel) – It is important to evaluate the costs for eliminating the SPOF, the likelihood that it may fail, and the time required to bring it into service again • Downtime – the period of time a service is down. Planned and unplanned. Planned downtime is part of the overall availability
  8. 8. General Terms • Switchover – When a manual process is used to switch from one system to a redundant or standby system in case of a failure • Failover – Automatic switchover, without human intervention • Failback – A (often-underestimated) task to handle the recovery of a failed system and how to fail-back to the original system after recovery
  9. 9. Data Redundancy HA for MariaDB
  10. 10. Replication Scheme All nodes are masters and applications can read and write from/to any node Synchronous Replication The Master does not confirm transactions to the client application until at least one slave has copied the change to its relay log, and flushed it to disk Semi-Syncronous Replication The Master does not wait for Slave, the master writes events to its binary log and slaves request them when they are ready Asynchronous Replication
  11. 11. HA Begins with Data Replication • Replication enables data from one MariaDB server (the master) to be replicated to one or more MariaDB servers (the slaves) • MariaDB Replication: – Very easy to setup: • On master: Define a replication user • On slave: CHANGE MASTER TO … <options> – Used to scale out read workloads – Provide a first level of high availability and geographic redundancy – Allows to offload backups and analytic jobs to slaves
  12. 12. Asynchronous Replication • MariaDB Replication is asynchronous by default. • Slave determines how much to read and from which point in the binary log • Slave can be behind master in reading and applying changes. – Single threaded vs parallel replication • If the master crashes, transactions might not have been transmitted to any slave • Asynchronous replication is great for read scaling as adding more replicas does not impact replication latency
  13. 13. Asynchronous Replication-Switch Over 1. The master server is down 2. The slave(s) server(s) is(are) updated to the last position in the relay log 3. Determine which slave server is the most suitable to promote to master 4. Point reminding slaves to the promoted server 5. Point applications to new master server ReadOnly Slaves ReadOnly Slaves Master and Slaves Master and Slaves
  14. 14. Async Replication Topologies Master and Slaves ReadOnly Slaves Master with Relay Slave Circular Replication
  15. 15. MaxScale Use Case Asynchronous Replication Failover New in MaxScale v2.2 Each application server uses only 1 connection MaxScale identifies the “master” and “slaves” nodes If the “master” node fails, a new one can be selected and promoted MariaDB Replication + R/W split routing Max Scale Master and Slaves
  16. 16. MariaDB GTID Implementation • Always ON since MariaDB v10.0 – Compatible w/ non-GTID replication: binary log file and position. • Allows for better control of the replication chain. – Slave position is recorded crash safe in the same transaction as the last successful DML statement – Doesn’t require knowing the last binary log file name and position. – Replication will start from the last recorded GTID • Allows multi-master replication – A single slave can have multiple incoming Replication Streams • GTID Components: – Domain ID: Allows to identify the logical origin of the transactions. – Server ID: Identifies the server where the transaction originated. – Transaction Sequence: Monotonically increasing number identifying the transaction.
  17. 17. Semi-synchronous Replication • MariaDB supports semi-synchronous replication: – The master does not confirm transactions to the client application until at least one slave has copied the change to its relay log, and flushed it to disk. – Eliminates data loss by securing a copy of all transactions in at least one slave. – When a commit returns successfully, it is known that the data exists in at least two places (on the master and at least one slave). – Semi- synchronous has a performance impact due to the additional round trip. • Adds the network latency to the transaction processing time • One or more slaves can be defined as working semi-synchronously. – Master will downgrade to asynchronous replication after waiting for a timeout period. – Once a semi-synch slave comes back online, the master will reset back to semi-synch replication. – Status variable: Rpl_semi_sync_master_status •
  18. 18. Semi-Sync Replication Topologies • Semi- synchronous replication is used between master and backup master • Semi- sync replication has a performance impact, but the risk for data loss is minimized. • This topology works well when performing master failover – The backup master acts as a warm-standby server – it has the highest probability of having up-to-date data if compared to other slaves. Semi_sync Asynchronous ReadOnly/ Backup Master ReadOnly
  19. 19. MariaDB Multi-Source Replication • It enables a slave to receive transactions from multiple sources simultaneously. • It can be used to backup multiple servers to a single server, to merge table shards, and consolidate data from multiple servers to a single server. • GTID helps to track transactions coming from different servers / applications. • Note: There is not conflict resolution. Last DML to reach the slave ‘wins’ Master 2Master 1 Master 3 Slave
  20. 20. Combining MariaDB Replication Features • Replication features can be combined to form more resilient configurations • Example: – Implement semi-sync circular replication to increase data resilience – Use GTID to avoid duplicate transactions – Use read-only slaves for read scale out – Use MaxScale: • Transactions will go to active master • Reads will be offloaded to slaves • Fast failover – Writes go to a single master at any given time Semi_sync Asynchronous Backup Master ReadOnly
  21. 21. Synchronous Replication (Galera) • Galera Replication is a synchronous multi-master replication plug-in that enables a true master-master setup for InnoDB. • Every component of the cluster (node) is a share nothing server • All nodes are masters and applications can read and write from any node – NOTE: No conflict resolution • A minimal Galera cluster consists of 3 nodes: – A proper cluster needs to reach a quorum (i.e. the majority of the nodes of the cluster) • Transactions are synchronously committed on all nodes. MariaDB MariaDB MariaDB
  22. 22. MaxScale Use Case MDBE Cluster Synchronous Replication Each application server uses only 1 connection MaxScale selects one node as “master” and the other nodes as “slaves” If the “master” node fails, a new one can be elected immediately Galera Cluster + R/W split routing Max Scale
  23. 23. Use Cases MariaDB Galera Cluster in Real Life
  24. 24. Geographically Distributed Cluster ● ● International ferry services company located in Germany ● Active in different regions of the world ● Booking System for Ferry Services with instances for the Americas and EMEA ● Bookings should be available for transfer to SAP immediately for invoicing ● Disaster backup needed if data center in US fails
  25. 25. ● Travel Agencies all over America can book 7x24 ● Synchronization is fast enough to not produce recognizable delay for the user ● Cluster Performance depends on ○ Size of Write Sets ○ Bandwidth between nodes ● Three node cluster ○ two nodes for agencies to book against ○ one node for transfer to SAP in headquarter ○ node in headquarter also for disaster backup if US data center fails completely ● Similar setup for EMEA region Geographically Distributed Cluster MariaDB MariaDB MariaDB Segment 1 United States SAP Segment 1 Germany Certification Process ● Local COMMIT initiated ● Generate Write Set on active node ● Send Write Set to all other nodes ● Wait for all nodes to acknowledge ● Execute local COMMIT
  26. 26. Production Control in Medical Engineering ● In medical engineering it is required that every part of a medical device can be tracked throughout its entire lifecycle ● Data to be gathered throughout production: ○ parts details (timestamp of production, part and serial number, tolerances etc.) ○ parts list per assembled device ○ test results ○ error rate ● Very long lifecycles of medical devices ● Open Source Database with open data format is best to guarantee data can be accessed even in 50 years
  27. 27. Production Control in Medical Engineering ● In medical engineering it is required that every part of a medical device can be tracked throughout its entire lifecycle ● Data to be gathered throughout production: ○ parts details (timestamp of production, part and serial number, tolerances etc.) ○ parts list per assembled device ○ test results ○ error rate ● For a continuous production a zero downtime infrastructure is needed ● Replacement for Oracle RAC Cluster with replacement of old solution
  28. 28. Production Control in Medical Engineering ● Control production lines for medical technology ● Gather production details for all parts and assembled medical devices ● Generate production and quality reports ● Provide detailed assembly data throughout the life cycle of each individual device ● Raise alerts in case ○ tolerances are not met ○ error rate rises ○ production line failures MariaDB MariaDB MariaDB
  29. 29. Thank you

×