Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

343 views

Published on

Good monitoring can be the difference between a great night's sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn

  1. 1. ©2016 Couchbase Inc. Monitoring Production Deployments TheTools – LinkedIn Alex Ma – Principal Architect – Couchbase Michael Kehoe – Staff Site Reliability Engineer - LinkedIn 1
  2. 2. ©2016 Couchbase Inc.©2016 Couchbase Inc. Overview • MonitoringTools • Making sense of the data • External Monitoring Integrations • Summary 2
  3. 3. ©2016 Couchbase Inc. 3 Alex Ma PrincipalArchitect, StrategicAccounts alex@couchbase.com IMAGE GOES HERE
  4. 4. ©2016 Couchbase Inc. 4 Michael Kehoe Staff Site Reliability Engineer (SRE) - LinkedIn mkehoe@linkedin.com • Production-SRE team • Member of CBVT • Australian! • Contact • linkedin.com/in/michaelkkehoe • @matrixtek GOES HERE
  5. 5. ©2016 Couchbase Inc. 5 MonitoringTools
  6. 6. ©2016 Couchbase Inc. 6 MonitoringTools – CouchbaseWeb Console
  7. 7. ©2016 Couchbase Inc. 7 MonitoringTools – CouchbaseWeb Console
  8. 8. ©2016 Couchbase Inc. 8 MonitoringTools – CouchbaseWeb Console
  9. 9. ©2016 Couchbase Inc. 9 MonitoringTools – Couchbase REST API • http://docs.couchbase.com/admin/admin/REST/rest-bucket-stats.html • GET /pools/default/buckets/[bucket-name]/stats • JSON output format • 60 collections per metric
  10. 10. ©2016 Couchbase Inc. 10 MonitoringTools - cbstats • http://docs.couchbase.com/admin/admin/CLI/cbstats-intro.html • Command Line tool for viewing stats • 333+ Available stats • Cumulative and Snapshot
  11. 11. ©2016 Couchbase Inc. 11 MonitoringTools - cbstats • Average value size = ep_value_size/(curr_items_tot-ep_num_non_resident) • ep_value_size = Amount of RAM used to hold values in this bucket for this node • Curr_items_tot =Total count of active/replica items in this bucket for this node • Ep_num_non_resident =Total number of items not resident in RAM • 9567135872 / ( 28733039 – 26582747 ) = 4449.22 bytes
  12. 12. ©2016 Couchbase Inc. 12 MonitoringTools - cbstats • Cbstats can be pointed to a specific host and a specific port
  13. 13. ©2016 Couchbase Inc. 13 MonitoringTools - cbstats • CbstatsTimings • Histogram that shows the timing of a number of internal operations • Commit to disk, background IO operations, GET ops • http://docs.couchbase.com/admin/admin/CLI/CBstats/cbstats-timing.html
  14. 14. ©2016 Couchbase Inc. 14 MonitoringTools - Queries • http://developer.couchbase.com/documentation/server/current/tools/query-monitoring.html • http://localhost:8093/admin/vitals
  15. 15. ©2016 Couchbase Inc. 15 MonitoringTools - htop • Htop|Top|vmstat|proc • Core Utilization • Customization
  16. 16. ©2016 Couchbase Inc. 16 MonitoringTools - iostat • IO Utilization • Average wait times • Read/Write requests • Determine Capacity
  17. 17. ©2016 Couchbase Inc. 17 MonitoringTools - iostat • IO Utilization • Average wait times • Read/Write requests • Determine Capacity
  18. 18. ©2016 Couchbase Inc. 18 MonitoringTools - iftop • See where traffic is coming from • Measure replication throughput • Verify Capacity
  19. 19. ©2016 Couchbase Inc. 19 Making Sense of the data
  20. 20. ©2016 Couchbase Inc. 20 Key Statistics Metrics to Consider: • Couchbase-Server • Client application • Disk • Network
  21. 21. ©2016 Couchbase Inc. 21 Key Statistics – Couchbase Server
  22. 22. ©2016 Couchbase Inc. 22 Key Statistics – Couchbase Server Metrics to Consider: • Operations • Cache miss (ep_cache_miss_rate) • Active/Replica vbuckets (vb_active_num/vb_replica_num) • Percentage of items in memory (vb_active_resident_items_ratio) • Disk Queue (ep_diskqueue_items) • Misdirected Requests (ep_num_not_my_vbuckets)
  23. 23. ©2016 Couchbase Inc. 23 Key Statistics – Couchbase Client Metrics to Consider: • Call-time latency • Measure GET’s/ SET’s separately • Hit-rate • Is the hit-rate what you expected • Errors • Timeouts retrieving objects • Unable to reach Couchbase-Server • See http://developer.couchbase.com/documentation/server/4.0/sdks/java-2.2/event-bus- metrics.html
  24. 24. ©2016 Couchbase Inc. 24 Key Statistics – Couchbase Client
  25. 25. ©2016 Couchbase Inc. 25 Key Statistics – Disk Metrics to Consider: • Disk Space • Compaction • Rebalance • Disk IO • Can disk sustain required IOPS • Disk Queue
  26. 26. ©2016 Couchbase Inc. 26 Key Statistics – Network Metrics to Consider: • Network connectivity • Connections • Capacity/ Utilization
  27. 27. ©2016 Couchbase Inc. 27 Key Statistics – Network – Connectivity • Ping - simple network connectivity test • Firewalls – make sure you have the correct ports open • See http://developer.couchbase.com/documentation/server/current/install/install-ports.html
  28. 28. ©2016 Couchbase Inc. 28 Key Statistics – Network – Connections • File-descriptor limits • Connections in CLOSE_WAIT state • Collect stats from /proc/net/tcp
  29. 29. ©2016 Couchbase Inc. 29 Key Statistics – Network – Capacity/ Utilization • Practical network capacity is ~85-90% of theoretical • E.g. 1Gb/s network interface can do 850-900Mb/s • Congested networks are problematic • Higher latency on responses • Slower replication • Collect stats from /proc/net/dev
  30. 30. ©2016 Couchbase Inc. 30 Key Statistics – Network – Capacity/ Utilization • Practical network capacity is ~85-90% of theoretical (1250 Mb/s) • E.g. 1Gb/s network interface can do 850-900Mb/s Average object size (bytes) 4,096 ID length (bytes) 32 Meta data size (bytes) 56 Reads 100,000 Writes 60,000 Replica count 1 Read network utilization 421,600,000 Write network utilizaation 502,080,000 Total network utilization 923,680,000 1.25 billion theoretical max remaining bandwidth 276,320,000
  31. 31. ©2016 Couchbase Inc. 31 External Monitoring Integrations
  32. 32. ©2016 Couchbase Inc. 32 External Monitoring Integrations
  33. 33. ©2016 Couchbase Inc. 33 External Monitoring Integrations – Write your own Getting Started • Use Couchbase REST API • Pipe ‘cbstats’ output
  34. 34. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase REST API • Examples • Datadog – http://lnkd.in/cb-datadog • This Example – http://lnkd.in/cb-stats-collector 34
  35. 35. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase REST API 35
  36. 36. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase REST API 36
  37. 37. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase REST API 37
  38. 38. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase CBstats 38
  39. 39. ©2016 Couchbase Inc.©2016 Couchbase Inc. Using Couchbase CBstats 39
  40. 40. ©2016 Couchbase Inc. 40 Summary
  41. 41. ©2016 Couchbase Inc. 41 Summary Important to have monitoring in-place Understand the metrics you monitor • What causes them • How to remediate
  42. 42. ©2016 Couchbase Inc. ThankYou! 42

×