Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

1,395 views

Published on

Jerry talks about lessons learned taking the second-highest QPS Couchbase Server at LinkedIn from meltdown to awesome. The story of Couchbase at LinkedIn is an insightful and suspense-filled story of making slow progress in tuning a system for performance. Jerry's team runs one of the fastest (by QPS or latency standards) Couchbase clusters within LinkedIn by making clever use of SSD's and by tuning the parameters correctly. Achieving this kind of standard in tuning is done primarily by avoiding the use of buckets, and tuning the right number of reader-vs-writer threads. In this session, learn best practices for tuning at scale.

Published in: Technology
  • Be the first to comment

Couchbase Server Scalability and Performance at LinkedIn: Couchbase Connect 2015

  1. 1. Benjamin (Jerry) Franz Sr Site Reliability Engineer
  2. 2. Scalability and Performance Lessons learned taking the second highest QPS Couchbase server at LinkedIn from zero to awesome Couchbase
  3. 3. Day 1
  4. 4. Meet the Couchbase Cluster • Three parallel clusters of 16 machines • 64 Gbytes of RAM per machine • 1 TB of RAID1 (spinning drives) per machine • 6 buckets in each cluster
  5. 5. Meet the Couchbase Cluster • Three parallel clusters of 16 machines • 64 Gbytes of RAM per machine • 1 TB of RAID1 (spinning drives) per machine • 6 buckets in each cluster • Massively under resourced • Memory completely full • 1 node failed out in each parallel cluster • Disk I/O Utilization: 100%, all the time • No Alerting
  6. 6. Meet the Couchbase Cluster • Three parallel clusters of 16 machines • 64 Gbytes of RAM per machine • 1 TB of RAID1 (spinning drives) per machine • 6 buckets in each cluster • Massively under resourced • Memory completely full • 1 node failed out in each parallel cluster • Disk I/O Utilization: 100%, all the time • No Alerting
  7. 7. The immediate problems • Unable to store new data because the memory was full and there wasn’t enough I/O capacity available to flush it to disk.
  8. 8. The immediate problems • Unable to store new data because the memory was full and there wasn’t enough I/O capacity available to flush it to disk. • Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further.
  9. 9. The immediate problems • Unable to store new data because the memory was full and there wasn’t enough I/O capacity available to flush it to disk. • Aggravated by nodes failing out of the cluster - reducing both available memory and disk IOPS even further. • There was no visibility into cluster health because 1. We didn’t know what healthy metrics should look like for Couchbase. We didn’t even know which metrics were most important. 2. Alerts were not being sent even when the cluster was in deep trouble.
  10. 10. The First Aid • Configured alerting
  11. 11. The First Aid • Configured alerting • Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted.
  12. 12. The First Aid • Configured alerting. • Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted. • Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS.
  13. 13. The First Aid • Configured alerting. • Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted. • Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS. • Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O.
  14. 14. The First Aid • Configured alerting. • Started a temporary program of semi-manual monitoring and intervention to keep the cluster from falling over as it got too far behind. When it did get too far behind, we deleted all the data in the affected buckets and restarted. • Doubled the number of nodes (from 48 to 96) in the clusters to improve available memory and disk IOPS. • Increased the disk fragmentation threshold for compaction from 30% to 65% to reduce disk I/O. • Reduced metadata expiration time from 3 days to 1 day to free memory.
  15. 15. Node failouts - Solved • The node failouts had two interacting causes: 1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes.
  16. 16. Node failouts - Solved • The node failouts had two interacting causes: 1. Linux Transparent HugePages were active on many nodes causing semi-random slowdowns of nodes lasting up to several minutes when memory was defragged - causing them to fail out of the cluster. Fixed by correcting the kernel settings and restarting the nodes. 2. ‘Pre-failure’ drives were going into data recovery mode and causing failouts on the affected nodes during the nightly access log scan at 10:00 UTC (02:00 PST/03:00 PDT).
  17. 17. Disk Persistence – Not solved • Despite more than doubling the available system resources, tuning filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing.
  18. 18. Disk Persistence – Not solved • Despite more than doubling the available system resources, tuning filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing. • The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.
  19. 19. Disk Persistence – Not solved • Despite more than doubling the available system resources, tuning filesystem options for performance, and slashing the amount of data being fed to it by the application, disk utilization remained stubbornly close to 100% and the disk queues were still growing. • The cluster had a huge amount of ‘hidden I/O demand’. Because a large fraction of the data had very short TTLs but was taking up to a day to persist to disk, it was expiring in queue before it could be persisted. This was actually doing quite a lot to keep the cluster from falling over completely as it acted to throttle the disk demand as the cluster became overloaded. We were actually now persisting twice as much data as before.
  20. 20. Cluster Health Visibility – Solved • Alerts were being sent to the appropriate people - cluster was not suffering from outages without notices. • Critical cluster metrics were identified and being used to for health monitoring and to measure performance tuning improvements.
  21. 21. Cluster Health Visibility – Solved The most important performance metrics • ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements.
  22. 22. Cluster Health Visibility – Solved The most important performance metrics • ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements. • ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days.
  23. 23. Cluster Health Visibility – Solved The most important performance metrics • ep_diskqueue_items – The number of items waiting to be persisted. This should be a somewhat stable number day to day. If if has a persistently upward trend that means cluster is unable to keep up with its disk I/O requirements. • ep_storage_age – The age of the most recently persisted item. This has been a critical metric for quantifying the effects of configuration changes. A healthy cluster should keep this number close to or below 1 second on average. We started with values approaching days. • vb_active_perc_mem_resident – The percentage of items in the RAM cache. For most clusters at LinkedIn it should be 100%. If it falls below that, the cluster is probably underprovisioned and taking a big performance hit.
  24. 24. Overall Status Update
  25. 25. Servers. Lots of servers. My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64). This was getting expensive
  26. 26. Servers. Lots of servers. My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64). This was getting expensive
  27. 27. Servers. Lots of servers. My best estimate was that to meet our I/O requirements we would have to at least double our total node count again, to 192 servers total (3 x 64). This was getting expensive It was time to change up my strategy
  28. 28. SSDs
  29. 29. SSDs Initial Integration Testing • 2 x 550GB Virident SSDs were integrated into one of the sub-clusters
  30. 30. SSDs Initial Integration Testing • 2 x 550GB Virident SSDs were integrated into one of the sub-clusters • Reduced cluster to 16 nodes to test under heavier load
  31. 31. SSDs Initial Integration Testing • 2 x 550GB Virident SSDs were integrated into one of the sub-clusters • Reduced cluster to 16 nodes to test under heavier load • Write I/O on SSDs shot up to multiple times of the rate of the HDDs
  32. 32. SSDs Initial Integration Testing • 2 x 550GB Virident SSDs were integrated into one of the sub-clusters • Reduced cluster to 16 nodes to test under heavier load • Write I/O on SSDs shot up to multiple times of the rate of the HDDs • Performance scaling indicated that in the final configuration we would burn through the SSD lifetime write capacity in less than one year
  33. 33. SSDs SSD Strategy Tuning • Switched to 2200 GB Virident SSDs to extend service life • Reduced cluster size to 8 nodes per sub-cluster (24 nodes total)
  34. 34. Full Scale SSD Impact
  35. 35. Full Scale SSD Impact For the first time since the cluster was turned on nearly a year ago, almost all of our data was getting persisted to disk. So we converted the other two clusters as well.
  36. 36. Full Scale SSD Impact For the first time since the cluster was turned on nearly a year ago, almost all of our data was getting persisted to disk. So we converted the other two clusters as well. Done?
  37. 37. Full Scale SSD Impact For the first time since the cluster was turned on nearly a year ago, almost all of our data was getting persisted to disk. So we converted the other two clusters as well. Done? Not Yet
  38. 38. Turning it up to 11 While we were no longer completely on fire, we weren’t yet awesome
  39. 39. Turning it up to 11 While we were no longer completely on fire, we weren’t yet awesome We were still taking up to 40 minutes to persist new data
  40. 40. Turning it up to 11 While we were no longer completely on fire, we weren’t yet awesome We were still taking up to 40 minutes to persist new data It wasn’t the drives at this point – it was the application
  41. 41. Turning it up to 11 While we were no longer completely on fire, we weren’t yet awesome We were still taking up to 40 minutes to persist new data It wasn’t the drives at this point – it was the application It wasn’t keeping up with the drives
  42. 42. Preparing Couchbase for Ludicrous Speed
  43. 43. Preparing Couchbase for Ludicrous Speed Increase the number of reader/writer threads to 8
  44. 44. Preparing Couchbase for Ludicrous Speed Consolidate the buckets (4 high QPS buckets -> 2 high QPS buckets)
  45. 45. Preparing Couchbase for Ludicrous Speed Increased frequency of disk cleanup (exp_pager_stime) to every 10 minutes
  46. 46. And buckle your seatbelt
  47. 47. And buckle your seatbelt 75% writes (sets + incr) / 25% reads – 13 byte values, 25 byte keys on average 2.5 billion items (+ 1 replica) 600 Gbytes of RAM / 3 Tbytes of disk in use on average Average store latency ~ 0.4 milliseconds 99th percentile store latency ~ 2.5 milliseconds Average get latency ~ 0.8 milliseconds 99th percentile get latency ~ 8 milliseconds
  48. 48. And buckle your seatbelt
  49. 49. And buckle your seatbelt
  50. 50. And buckle your seatbelt
  51. 51. The End

×