Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

2,098 views

Published on

From operation timeouts to node failures, what happens when things go wrong with your Couchbase application? Drawing on experience of real-world customer cases, we’ll share some of our war stories on what can happen to Couchbase deployments, the steps we took to troubleshoot them and how they could have been avoided.

Published in: Technology
  • Be the first to comment

Best Practices: Troubleshooting Your Couchbase Application: Couchbase Connect 2015

  1. 1. TROUBLESHOOTING COUCHBASE David Haikney Head of Technical Support Couchbase EMEA
  2. 2. ©2015 Couchbase Inc. 2 What Can Possibly Go Wrong?  Help! My Node is Down  Why Did My Operation Fail?  Some HandyTools The following presentation is based on actual events…
  3. 3. Help! My Node is Down
  4. 4. ©2015 Couchbase Inc. 4 Why Does A Node Go Down?... Node 4 Node 1 Node 2 Node 3
  5. 5. ©2015 Couchbase Inc. 5 …Because Heartbeats Have Gone Missing Node 4 Node 1 Node 2 Node 3
  6. 6. ©2015 Couchbase Inc. 6 04 15:
  7. 7. ©2015 Couchbase Inc. 7 1. Server Itself is Down  Server (orVM) going offline takes Couchbase with it  Occurs during planned or “unscheduled” maintenance  What matters is server status at the time of the event  (VMs) check status in management console  Check server’s uptime  Couchbase’s own uptime: cbstats|grep uptime TroubleshootingTips!
  8. 8. ©2015 Couchbase Inc. 8 2. Server is Unreachable  Server unavailable on network  Usually related to wider datacenter event  Our assailant may have fled the scene!  Verify connectivity from other cluster nodes (ping/ ssh)  Check system and network logs around time of first report  Standard network monitoring should be deployed TroubleshootingTips!
  9. 9. ©2015 Couchbase Inc. 9 3. Couchbase Service is Offline  Server is available but Couchbase service is not running  Relatively rare since babysitter restarts failed processes  Check which Couchbase processes are running  Check dmesg for Linux’s OOM Killer:  May need to reduce the server quota  Attempt to restart the service  Possibly warming up - monitor for progress with cbstats TroubleshootingTips! May 14 04:26:49 cgs1 kernel: memcached invoked oom-killer
  10. 10. ©2015 Couchbase Inc. 10 4. CouchbaseToo Slow to Respond  Service is running but failed to send timely heartbeats  Common trigger of autofailover  THP, Sizing, Swap, Hypervisor Disturbance  Transparent Huge Pages (THP) disabled on Linux systems  Follow Sizing Best Practice  Check for system swap usage  AvoidVM Over-commit or migration  Increase the autofailover timeout TroubleshootingTips!
  11. 11. ©2015 Couchbase Inc. 11 Sizing Matters 8 Buckets 60 Design Docs 10 XDCR Streams 8 CPU Cores See Perry Krug’s session at 5:15pm today!
  12. 12. Why Did My Operation Fail?
  13. 13. ©2015 Couchbase Inc. 13 The lifecycle of a simple get() Operation Couchbase Server Node 1 Client Your Application Couchbase SDK Network Couchbase Server Node 2
  14. 14. ©2015 Couchbase Inc. 14 Possible Pitfalls: Operation couldn’t be dispatched ? Client Your Application Couchbase SDK  Use a singleton pattern  Check node health on cluster  Check vbucket map from server  If necessary, restart client  Tune Garbage Collection Settings TroubleshootingTips!  Client unsuccessfully initialised  Server Connections exhausted  Stop theWorld Garbage Collection  Consecutive Failovers
  15. 15. ©2015 Couchbase Inc. 15 Possible Pitfalls: Operation Did Not Complete Couchbase Server Node Client Your Application Couchbase SDK  Look for a pattern as to which clients and servers are affected  Code defensively and retry (at least once)  Check network and server health  When all else fails, tcpdump / wireshark… TroubleshootingTips!  Did operation arrive at server?  Firewalls often intervene!  Did server have difficulty responding?  Did client receive a response?
  16. 16. ©2015 Couchbase Inc. 16 tcpdump and Wireshark  Wireshark is Couchbase aware  Uses tcpdump packet capture  Can be noisy so filter wisely  specific client  Specific server  Port 11210
  17. 17. ©2015 Couchbase Inc. 17 Possible Pitfalls: NotThe Response I Expected  Simplest explanation is often the correct one  Test your application code in node down, failover and rebalance scenarios  Defensive Coding for all Couchbase operations  E.g. forTemporary Out Of Memory  Have a simple client for quick tests to isolate the problem  < 10 lines of python! TroubleshootingTips! Customer: My operation failed with a “Key Does Not Exist” Error CB Support: OK, are you sure the key exists? Customer: Yes, our logs show the key being created and we don’t do deletes. CB Support: The delete counter on your cluster is increasing….
  18. 18. Some Other HandyTools
  19. 19. ©2015 Couchbase Inc. 19 Scenario:Troubleshooting a 2.2.0 XDCR case 5 Million docs went swimming one day, Off to a cluster far away, XDCR said quack-quack-quack-quack, But only 4,999,999 docs came back
  20. 20. ©2015 Couchbase Inc. 20 Finding the Needle in the Haystack  Identify a single vbucket that exhibits the problem  Immediately narrows the problem space to 1/1024th of the data set  cbstats vbucket-details  Interrogate the files on disk to find the discrepancy  couch_dbdump --no-body --json --by-id <file>  jq tool is very useful for CLI json parsing!  Diff the source cluster and the destination cluster  Easier on a static data set but still feasible on a live cluster  With ops in flight, perform the diff twice and take the intersection
  21. 21. ©2015 Couchbase Inc. 21 FinalThoughts  You now know the common causes of Node and Operation Failures  Troubleshooting requires taking a logical path through the scenarios  Tools exist to help you isolate the problem  We are an open kitchen: issues.couchbase.com  Support team available for 24 x 7 x 365 emergency assistance….  Co-located with developers for fastest response time  … but we hope you’ll never need us!
  22. 22. Thank you.

×