Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015


Published on

Reliably delivering data to applications in a high performance way is where Couchbase shines, but maintaining a high-performance application is not just a job for Couchbase Server. Couchbase Server meets very stringent performance and availability needs, but to successfully deliver data at scale, all application components need to work together as a single system. For example, you need to be prepared for various edge conditions like expected “TMPFAIL”s, handling failovers, and dealing with higher latencies under load. Good thing you have the tools you need from the Couchbase SDK. In this session, Michael and Matt will show patterns for handling these kinds of scenarios and talk about some of the great failures from years of experience, how they can be prevented and demonstrate some techniques for making the entire system more reliable and able to recover faster.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Ship It!!! Coding Reliable Couchbase Applications for Production: Couchbase Connect 2015

  1. 1. SHIP IT!!! CODING RELIABLE COUCHBASE APPLICATIONS FOR PRODUCTION Matt Ingenthron, Couchbase Michael Nitschinger, Couchbase
  2. 2. ©2015 Couchbase Inc. 2 Warning In this session you will hear stories of lost packets, corrupted data, confused administrators sending terabytes of logs to even more confused developers and many other insanely scary things. If the thought of a bit flip frightens you because you have only parity checking and no error correction, this session may not be for you. Computers were harmed while preparing this talk. If what you typically type after “catch” involves only the word “log”, this session may help you. If you hope to learn how an HTTP 503 can be useful, this presentation is for you.
  3. 3. Game ShowTime (war stories from the field)
  4. 4. ©2015 Couchbase Inc. 4 Obligatory Raising of Hands  Who here has used Couchbase?  Who has seen this?
  5. 5. ©2015 Couchbase Inc. 5
  6. 6. ©2015 Couchbase Inc. 6 Question One  System:Virtual machines at a public cloud provider. Node.js application.  Observation: Under load testing, saw high latencies (>100ms).  Causes?  Root cause:The ethernet device driver in the linux distro didn’t work that well with the virtualized hardware interface causing high latencies.  Solution: Swap out the Linux OS distribution.  Went from one that was less common but had better user tooling to one of the most common ones in production deployments A) Bugs in Couchbase. B)The system software wasn’t well matched and tested. C) Running too many node.js processes for the number of OS CPU cores. D) It’s the “cosmic rays” man.
  7. 7. ©2015 Couchbase Inc. 7 QuestionTwo  System: Private virtual machines on a private cloud. Strong monitoring and control of the environment  Observation: As daily load would ramp, latencies would rise and failure to meet the SLA would consume.  Causes?  Root cause: Memory resources were overprovisioned on the private cloud.  Solution: Adjust the memory allocation within the environment.  Also found that the number of tomcat workers was rather unusually set; thousands of worker processes for systems with 8 virtual cores. A) Bugs in Couchbase. B) JVM Garbage Collection Pauses. C)Virtualization is overprovisioned. D)The NSA wiretap program was slowing things down.
  8. 8. ©2015 Couchbase Inc. 8 QuestionThree  System: Database running on physical hardware, applications on VMs across the network. SLA need was 50ms or less.  Observation: Regular heartbeat of high latency in the 3-400ms range.  Causes?  Root cause:The monitoring system was inspecting kernel counters on a regular basis and was somehow hitting a hot lock.  Solution: Disable that one poller in the monitor.  There were no other apps in that environment that had the same latency requirements, so it was assumed that the environment was clean. A) Bugs in Couchbase. B) Misconfigured load balancer sending all traffic to one app JVM. C) Monitoring system interrogating the kernel causing lock contention. D) Standing waves from running a 50hz power supply under 60hz.
  9. 9. Planning for Success
  10. 10. ©2015 Couchbase Inc. 10 Define & Measure! Develop Test Measure Evaluate Requirements If it‘s not defined you can‘t measure it.  SLAs  Throughput at max. Latency
  11. 11. ©2015 Couchbase Inc. 11 Define & Measure! Develop Test Measure Evaluate Requirements Ideally from the get-go:  Error Detection  Error Recovery  Error Mitigation
  12. 12. ©2015 Couchbase Inc. 12 Define & Measure! Develop Test Measure Evaluate Requirements Not just unit testing.  StressTests  LoadTests  FailureTests
  13. 13. ©2015 Couchbase Inc. 13 Define & Measure! Develop Test Measure Evaluate Requirements You can‘t manage what you don‘t measure.
  14. 14. ©2015 Couchbase Inc. 14 Define & Measure! Develop Test Measure Evaluate Requirements Evaluate, rinse, repeat.
  15. 15. ©2015 Couchbase Inc. 15 Service Level Required  100% Uptime not easily achievable  For instance, is it 100% available if 50% of your users are leaving because it’s too slow?  The question must always be: “At max latency, what throughput do I get?”
  16. 16. ©2015 Couchbase Inc. 16 Avoid the Coffin Corner Height Speed
  17. 17. ©2015 Couchbase Inc. 17 Avoid the Coffin Corner  Both airplanes and your applications do not like the extremes  Resource contention and overload conditions result in high latency  Keep some headroom to fly smoothly
  18. 18. ©2015 Couchbase Inc. 18 Prepare for bad weather
  19. 19. ©2015 Couchbase Inc. 19 with Error Detection System Monitors Periodic Checking Watchdogs Voting Auditing
  20. 20. ©2015 Couchbase Inc. 20 with Error Recovery Timeouts Failover Retries
  21. 21. ©2015 Couchbase Inc. 21 with Error Mitigation Intelligent Data Structures Failing Fast Circuit Breakers Backpressure
  22. 22. ©2015 Couchbase Inc. 22 Timeouts  Are your last resort when calling external resources.  so: Always use them
  23. 23. ©2015 Couchbase Inc. 23 Timeouts
  24. 24. ©2015 Couchbase Inc. 24 Timeouts
  25. 25. ©2015 Couchbase Inc. 25 Circuit Breakers  monitor traffic  open if errors happen  Latency  Throughput  Wrong results  close in a controlled fashion  expose metrics
  26. 26. ©2015 Couchbase Inc. 26 Circuit Breakers
  27. 27. ©2015 Couchbase Inc. 27 Backpressure  Allows for coordinated flow control under stress conditions
  28. 28. ©2015 Couchbase Inc. 28 Backpressure  Allows for coordinated flow control under stress conditions  Is used to shed load and provide partial good experience Source:
  29. 29. Testing & Benchmarking
  30. 30. ©2015 Couchbase Inc. 30 This is NOT a benchmark
  31. 31. ©2015 Couchbase Inc. 31 This is NOT a benchmark
  32. 32. ©2015 Couchbase Inc. 32 Benchmarking  Benchmarks assert expectations while tests verfiy correctness  Like with statistics, almost always wrong and biased  Two hard problems in computer science:  Cache Invalidation  NamingThings
  33. 33. ©2015 Couchbase Inc. 33 Benchmarking  Benchmarks assert expectations while tests verfiy correctness  Like with statistics, almost always wrong and biased  TwoThree hard problems in computer science:  Cache Invalidation  NamingThings  Benchmarking
  34. 34. ©2015 Couchbase Inc. 34 Benchmarking  The appropriate Workload  Concurrency  ThinkTime  The right Environment  Hardware, OS  external effects  The properTool  Measure NOOPs  Be aware of GC, Coordinated Omission,...
  35. 35. ©2015 Couchbase Inc. 35 And the industry?  Yahoo! Cloud Serving Benchmark (YCSB)  Industry Standard  Makes it easy to compare solutions  Be aware of the (many) pitfalls!  Pioneering a new fork:  Maintained NoSQL versions  Coordinated Omission fixes  ...
  36. 36. ©2015 Couchbase Inc. 36 And the industry?  Java Microbenchmarking Harness (JMH) (
  37. 37. ©2015 Couchbase Inc. 37 Load & StressTesting  LoadTesting  Determine behaviour during normal traffic  StressTesting  Traffic heavily increased (to the “Coffin Corner“)  Explicitly test edge cases  Knowing where and how it breaks is important
  38. 38. ©2015 Couchbase Inc. 38 FailureTesting  Test specific failure cases  Node failures  Netsplits  Firewall issues (dropped packets, closed sockets)  Failures will happen, better to prepare for it early.
  39. 39. SomeTools to Consider
  40. 40. ©2015 Couchbase Inc. 40 Tools of the trade  Run tools to validate a set up with a reasonably known workload.  libcouchbase’s cbc pillowfight  Java’s RoadRunner  .NET’s MeepMeep  Isolate performance statistics at different layers.  libcouchbase and Java SDKs have performance profiling abilities  Couchbase has cbstats timings
  41. 41. Questions?
  42. 42. Thank you.