Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Maximizing high availability for your cluster – Connect Silicon Valley 2017

92 views

Published on

Speakers: Mel Boulos, Michael Nitschinger

For today’s mission-critical web, mobile, and IoT applications high availability is not a “nice to have” feature, it is an essential requirement. Downtime and degraded performance are unacceptable, resulting in unhappy customers and lost revenue.

Join this demo-filled session to learn how to deliver continuously available mission-critical applications across data centers using Couchbase Server and the brand-new multi-cluster-aware SDK extension. We will cover a wide array of high availability and disaster recovery features available (especially those introduced in Couchbase Server 5.0) and how to leverage them to support highly reliable, responsive applications that result in excellent customer experiences.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Maximizing high availability for your cluster – Connect Silicon Valley 2017

  1. 1. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. MAXIMIZING HIGH AVAILABILITY Mel Boulos, Couchbase Poonam Dhavale, Couchbase Michael Nitschinger, Couchbase
  2. 2. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. AGENDA 01/ 02/ 03/ 04/ High Availability Disaster Recovery 5.0 Spotlight: Fast Failover Highly Available Applications
  3. 3. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 1 HIGH AVAILABILITY
  4. 4. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 4 Couchbase Server – Single Node Architecture • Single node type is the foundation for a high availability architecture • No Single Point of Failure (SPOF) • Easy Scalability STORAGE Couchbase Server 1 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster ManagerCluster Manager Data Service Index Service Query Service Managed Cache Storage STORAGE Couchbase Server 1 SHARD 7 SHARD 9 SHARD 5 SHARDSHARDSHARD Managed Cache Cluster ManagerCluster Manager Data Service Index Service Query Service Managed Cache Storage
  5. 5. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 5 Intra-Cluster Replication – Data Redundancy Intra-cluster replication is the process of replicating data on multiple servers within a cluster in order to provide data redundancy. • RAM to RAM replication • Max of 4 copies of data in a Cluster • Bandwidth optimized through de-duplicate, or ‘de-dup’ the item
  6. 6. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 6 Rack-Zone Awareness • Informs Couchbase as to the physical distribution of nodes in racks or availability groups. • Ensures that active and replica vBuckets are distributed across groups • Servers 1, 2, 3 on Rack 1 • Servers 4, 5, 6 on Rack 2 • Servers 7, 8, 9 on Rack 3 • Cluster has 2 replicas (3 copies of data) • This is a balanced configuration
  7. 7. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 7 Rack-Zone Awareness • If an entire rack fails, data is still available • If an entire cloud availability zone fails, data is still available
  8. 8. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 8 Automatic Failover – “In action” SERVER 4 SERVER 5 Replica Active Replica ActiveActive SERVER 1 Shard 5 Shard 2 Shard 9Shard Shard Shard Replica Shard 4 Shard 1 Shard 8Shard Shard Shard Active SERVER 2 Shard 4 Shard 7 Shard 8 Shard Shard Shard Replica Shard 6 Shard 3 Shard 2 Shard Shard Shard Active SERVER 3 Shard 1 Shard 3 Shard 6Shard Shard Shard Replica Shard 7 Shard 9 Shard 5Shard Shard Shard • App servers accessing Shards • Requests to Server 3 fail • Cluster detects server failed • Promotes replicas of Shards to active • Updates cluster map • Requests for docs now go to appropriate server • Typically rebalance would follow Shard 1 Shard 3 Shard COUCHBASE Client Library CLUSTER MAP COUCHBASE Client Library CLUSTER MAP
  9. 9. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 9 Node Recovery – Bring Cluster back to Capacity • Failed-Over node can re-added back to cluster • Full recovery – Add back as a fresh node • Delta Node recovery – Add back failed node incrementally into the cluster without having to rebuild the full node.
  10. 10. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 2 DISASTER RECOVERY
  11. 11. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 11 Cross Datacenter Replication (XDCR) Unidirectional or Bidirectional Replication Unidirectional • Hot spare / Disaster Recovery • Development/Testing copies • Connectors (Solr, Elasticsearch) • Integrate to custom consumer Bidirectional • Multiple Active Masters • Disaster Recovery • Datacenter Locality
  12. 12. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 12 Cross Datacenter Replication (XDCR) • Replicates continuously data FROM source cluster to remote clusters may be spread across geo’s • Supports unidirectional and bidirectional operation • Application can read and write from both clusters (active – active replication) • Automatically handles node addition and removal • Simplified Administration via Admin UI, REST, and CLI • Pause and resume XDCR replication • (NEW in 4.0) Filtering of data on replication stream
  13. 13. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 13 XDCR after Write TO OTHER NODE APPLICATION SERVER MANAGED CACHE DISK DISK QUEUE REPLICATION QUEUE XDCR QUEUE TO OTHER CLUSTER DOC 1 DOC 1DOC 1DOC 1
  14. 14. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 14 XDCR Conflict Resolution Modes ● Revision-based Conflict Resolution [Default] Current XDCR conflict resolution uses the revision ID. Revision IDs keep track of number of mutations to a key, thus the current XDCR conflict resolution can be best characterized as “the most updates wins”. ● Timestamp-based Conflict Resolution [New] Timestamp-based conflict resolution uses the hybrid logical clock . Timestamp has both physical time (NTP) and a logical counter, thus the new XDCR conflict resolution is also known as Last Write Wins (LWW) and is best characterized as “the most recent update wins”.
  15. 15. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 15 What is Hybrid Logical Clock? • Hybrid Logical Clock is combination of physical time and logical counter. • Hybrid Logical Clock is represented as 64 bit integer • First 48 bit – physical time • Last 16 bit - logical counter • Hybrid Logical Clock is stored in CAS
  16. 16. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 16 High Availability - Cross-Cluster FailOver & FailBack • Applications can automatically switch-over the traffic to other cluster for high availability.
  17. 17. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 17 Efficient Recovery with Incremental Backup & Restore • Back up only the data updated since the last backup • Differential Backups • Cumulative Backups
  18. 18. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 3 5.0 SPOTLIGHT: FAST FAILOVER
  19. 19. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 19 Auto-Failover Quick Refresh • High Level Steps: 1. Failure Detection: Orchestrator detects a node is unhealthy. 2. Checks whether auto-failover is possible. 3. Fails it over, if possible. • Disabled by default. • Failover Timeout: Minimum time to confirm a node is unhealthy. • Prevent spurious failovers. • Default 120s. • Pre-5.0: Cannot be set to lower than 30s.
  20. 20. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 20 Fast Failover in 5.0 • Reduced minimum auto-failover timeout: ~5s (Enterprise Edition) • Community Edition: 30s. • No change in defaults: Auto-failover disabled by default. Default timeout 120s. • New robust failure detector. • New Reason field in the auto-failover log message. • E.g. Cluster manager on nodeA goes down. Following message will be displayed when nodeA is failed over. • “Node (nodeA) was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold.” • New alert generated when a sub-set of nodes in the cluster encounter communication issues.
  21. 21. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 21 Orchestrator Auto-Failover • For certain failures, auto-failover of the orchestrator node can take additional ~60s over the auto-failover timeout. • This behavior is same as the one in prior releases. • If the failure is due to data service issue then orchestrator failover is as fast as any other node in cluster. • Failover time for the orchestrator will be improved in the next release.
  22. 22. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 22 Robust Failure Detector in 5.0 • Considers view of all nodes in the cluster and not just that of the orchestrator. • Determines health of a node using information from multiple communication channels: • Cluster manager heartbeats • Intra-cluster replication traffic • Determines health of a bucket using information from multiple sources: • Intra-cluster replication traffic • Warmed buckets information from data service
  23. 23. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 23 Robust Failure Detector Design Server 1 Cluster Manager Data Service Monitor Node Monitor Cluster Manager Monitor Data Service Server 0 Cluster Manager Data Service Monitor Node Monitor Cluster Manager Monitor Data Service Node Status Analyzer Auto-Failover Server 2 Cluster Manager Data Service Monitor Node Monitor Cluster Manager Monitor Data Service Intra Cluster Replication Cluster Manager Heartbeats Cluster Manager Heartbeats Intra Cluster Replication Monitor Status Monitor Status Orchestrator
  24. 24. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 24 Node Status Analyzer & Failover Logic • Node Status Analyzer takes view of all monitors on all nodes to determine status of a node say NodeA. NodeA Status Condition Action after auto-failover timeout Healthy All monitors on all nodes report NodeA is active. None required. Unhealthy All monitors on all nodes report NodeA is inactive. Failover Data Service Issue Data service monitor on all nodes report one or more buckets on NodeA are inactive. Failover Cluster Manager Issue Cluster manager monitor on all nodes report NodeA is inactive. Failover Network Issue Cluster manager monitor on some nodes report NodeA is inactive while others report it as active. No Failover. Generate an alert.
  25. 25. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 4 HIGHLY AVAILABLE APPLICATIONS
  26. 26. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 26 Application View – Single Cluster
  27. 27. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 27 Application View – Multiple Clusters
  28. 28. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 28 Application View – Traffic Distribution
  29. 29. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 29 Application View – Cluster Failover
  30. 30. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 30 Application View – XDCR
  31. 31. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 31 Application View – XDCR Ring
  32. 32. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 32 Application View – Java MCA Architecture
  33. 33. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. D DEMO
  34. 34. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. THANK YOU
  35. 35. Confidential and Proprietary. Do not distribute without Couchbase consent. © Couchbase 2017. All rights reserved. 43 Headline content to go here 0 1 2 3 4 5 6 Category 1 Category 2 Category 3 Category 4 Chart Title Series 1 Series 2 Series 3

×