Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building self-managing Couchbase clusters in eBay's cloud – Connect Silicon Valley 2017

149 views

Published on

Speaker: Shenglin Du, eBay

eBay has one of the largest Couchbase deployments in the world. This session will talk about how to manage and maintain thousands of Couchbase nodes in eBay’s cloud platform. We are building the self-healing/self-managing NoSQL database ecosystem, which can help fix and cover over 95% database issues without human intervention. The session will also share the learning experience of using Couchbase at eBay.

Published in: Technology
  • Be the first to comment

Building self-managing Couchbase clusters in eBay's cloud – Connect Silicon Valley 2017

  1. 1. Building Self-Managing Couchbase Cluster in eBay Cloud Couchbase Connect 2017
  2. 2. Couchbase Connect 2017 Shenglin Du – eBay Database Engineer • Have worked as Oracle/MySQL DBA since 2006 • Have worked on Cassandra, MongoDB and Couchbase since 2013 • Building Self-Managing NoSQL Ecosystem in eBay Cloud
  3. 3. Couchbase Connect 2017 Different Databases Serve Different Purposes • RDBMS • Oracle • MySQL • NoSQL • Couchbase • MongoDB • Cassandra • Memcached • Platform • OpenStack Private Cloud Platform • Docker and Kubernetes
  4. 4. Couchbase Connect 2017 Couchbase in eBay • Over 100 clusters • A few thousands nodes • Three datacenters with XDCR • Major in version 3.1.6 and 4.6.2 • Running on VMs
  5. 5. Couchbase Connect 2016 How to manage Couchbase Clusters • DBLM – Database Lifecycle Manager • UI/CLI/API • Provision/Add/Replace/Decommission… • Dashboard • Auto Failure Detection and Remediation • Build in 2016 • Fix over 15 different kind of errors • Auto Tech-refresh • OS and Couchbase Upgrade • Keep adding new features
  6. 6. Couchbase Connect 2017 Couchbase Self-Remediation Architecture
  7. 7. Couchbase Connect 2017 Couchbase Error Check List • Cluster Level • Pending Rebalance • Unbalance Group Nodes • Node Not in Cluster • autofailover_quota - reset quota • Node Level • Host Down • Couchbase Down • Pending Recovery/Warmup • Free Memory Low • High Disk Usage • write_commit_failure
  8. 8. Couchbase Connect 2017 Couchbase Error Check List Cont. • Bucket Level • High XDCR Mutation • Item Count Mismatch • Paused XDCR • Important Metrics to Monitor • Cache Miss Ratio • Active Docs Resident • Disk Write Queue • Doc Fragmentation • Background Fetch Wait Time • Keep adding more
  9. 9. Couchbase Connect 2017 Auto Remediation Considerations • Error Remediation Priority • “Host Down” has higher priority than “Pending Rebalance” • “Not in Cluster” has higher priority than “Unbalance Group” • XDCR Mutation has low priority • Error Remediation Concurrency • Concurrent execution in the different cluster • Concurrent execution in the same cluster • Sequential execution and concurrent control • Exclusive List and Threshold • Recovered Without Remediation • Retry and Exception Handling
  10. 10. Couchbase Connect 2017 Example – Fix Host Down • Check if host is needed anymore • Host is recovered • Check if the node is warmup • Check if the node is pending recovery • Check if the rebalance is needed in cluster • Host needs to be replaced • Get host attribution, such version, flavor, group, service ... • Provision new host in Cloud • Install Couchbase binary and setup • Add, Remove/Failover the dead node • Rebuild index node • Rebalance the cluster
  11. 11. Couchbase Connect 2017 Self-remediation Success Rate
  12. 12. Couchbase Connect 2017 Couchbase Learning Experience • Lack of always available writes • Application option to write to remote DC when local write fails • Cross DC update conflict resolution • Upgrade to the latest stable version to avoid bugs • Metadata memory overhead • 56 bytes metadata is too much if you have a small key • Memory fragmentation • CB 4.x replaces TCMalloc with jemalloc libraries • Slow warm-up • Remove access log to speed up warm up • 10 bucket limits not working well for shared QA env
  13. 13. Couchbase Connect 2017 Couchbase Learning Experience • Build New Cluster for Upgrade • Using XDCR to sync up data and then cutover • Avoid too many replaces • Scale out instead of Scale up • Sharding using multiple small clusters • Different Group in different Fault Domain • Different HV for the hosts in the same group
  14. 14. Couchbase Connect 2017 Rebalance Issue • Using swap rebalance when applicable • Views Indexing and querying cause trouble with rebalance - MB-17025 • Pausing the incoming/outgoing XDCR • Sleep a few minutes and retry rebalance • Check Log • Rebalance exited with reason {not_all_nodes_are_ready_yet… • Increase nonio worker threads • Monitor using cbstats dispatcher • Check nonio thread setting using cbstats config | grep nonio • Need to restart memcached on all nodes • MB-19837 Increase default number of ep-engine nonIO threads in 4.5.1
  15. 15. Couchbase Connect 2017 XDCR Issue • Some nodes in the cluster has high outbound XDCR Mutation • MB-19832 XDCR going into infinite loop when node joins cluster fixed in 3.1.6 • MB-19697 more than one replication instances may be started when replication is resumed • That may be not the real XDCR lag • Checking Item Count instead of XDCR mutation • Pause and resume XDCR • Recreate XDCR and its reference • Restart Cluster Manager • Restart Node • Ensure All nodes on the source and target cluster have the same version
  16. 16. Couchbase Connect 2017 What We Plan to Do • VMs -> Docker • Dbaas – Database As a service

×