Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keynote: Scaling Sensu Go

127 views

Published on

For over eight years, the Sensu community has been using Sensu to monitor their applications and infrastructure at scale. Sensu Go became generally available at the beginning of this year, and was designed to be more portable, easier and faster to deploy, and most importantly: more scalable than ever before! In this talk, Sensu CTO Sean Porter will share Sensu Go scaling patterns, best practices, and case studies. He’ll also explain our design and architectural choices and talk about our plan to take things even further.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Keynote: Scaling Sensu Go

  1. 1. Scaling Sensu Go By Sean Porter, Co-founder & CTO.
  2. 2. Who am I? ● Creator of Sensu ● Co-founder ● CTO ● PorterTech 2
  3. 3. Overview 1. How we 10X’d performance in 6 months 2. Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3
  4. 4. 4 Goals for Sensu Go
  5. 5. 5
  6. 6. 6
  7. 7. Scale 7 In terms of: ● Performance ● Organization
  8. 8. GA 8 December 5th, 2018
  9. 9. ● Steep learning curve ● Requires RabbitMQ and Redis expertise ● Capable of scaling* Scaling Sensu Core (1.X) 9
  10. 10. Scaling Sensu Core (1.X) 10
  11. 11. Scaling Sensu Core (1.X) 11
  12. 12. 12
  13. 13. Step 1 - Instrument 13
  14. 14. ● Used AWS EC2 ● M5.2xlarge to i3.metal ● Agent session load tool ● Disappointing results (~5k) ● Inconsistent Step 2 - Test environment 14
  15. 15. Step 3 - Get serious 15
  16. 16. 16 Spent $10k on gaming hardware.
  17. 17. 17
  18. 18. ● Control ● Consistency ● Capacity Why bear bare metal? 18
  19. 19. 19
  20. 20. ● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Intel Gigabit CT PCIe Network Card Backend hardware 20
  21. 21. ● AMD Threadripper 2990WX (32 Cores, 3.0GHz) ● Gigabyte X399 AORUS PRO ● 32GB DDR4 2666MHz CL16 (4x 8GB) ● Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21
  22. 22. ● Two Ubiquiti UniFi 8 Port 60W Switches ● Separate load tool and data planes Network hardware 22
  23. 23. 23
  24. 24. ● Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at 5s interval Events/s: 6,400 ● Produced data! The first results 24
  25. 25. ● Identified several possible bottlenecks ● Identified bugs while under load! ● Began experimentation... The first results 25
  26. 26. ● Sensu Events! ● ~95% of etcd write operations ● Disabled Event persistence - 11,200 Events/s ● etcd max database size (10GB*) ● Needed to move the workload The primary offender 26
  27. 27. 27
  28. 28. 28
  29. 29. ● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29
  30. 30. Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not good enough! New results with PostgreSQL 3030
  31. 31. ● Multi-Version Concurrency Control ● Many updates - need aggressive auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31
  32. 32. ● Tune write-ahead logging ● Reduce the number of disk writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32
  33. 33. ● Burying Check TTL switch set on every Event! ● Additional etcd PUT and DELETE operations A huge bug! 33
  34. 34. Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much better! Still not good enough. New results with bug fix 3434
  35. 35. ● Several etcd range (reads) requests per Event ● Caching reduced etcd range requests by 50% ● No improvement to Event throughput :( Entity and silenced caches 35
  36. 36. ● Every object is serialized for transport and storage ● Changed from JSON to Protobuf ○ Applied to Agent transport and etcd store ○ Reduced serialized object size! ○ Less CPU time Serialization 36
  37. 37. ● Increased Backend internal queue lengths ○ From 100 to 1000 (made configurable) ● Increased Backend internal worker counts ○ From 100 to 1000 (made configurable) ● Increases concurrency and absorbs latency spikes Internal queues and workers 37
  38. 38. Agents: 36,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 34,200 Almost there!!! New results 3838
  39. 39. 39
  40. 40. Agents: 40,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 38,000 New results 4040
  41. 41. 41
  42. 42. ● https://github.com/sensu/sensu-perf ● Performance tests are reproducible ● Users can test their own deployments! ● Now part of release QA! The performance project 42
  43. 43. 43 What’s next for scaling Sensu?
  44. 44. Multi-site Federation ● 40,000 Agents per cluster ● Run multiple/distributed Sensu Go clusters ● Centralized RBAC policy management ● Centralized visibility via the WebUI 44
  45. 45. 45 Deployment architectures
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. 52
  53. 53. 53 Hardware recommendations*
  54. 54. Backend requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >50MB/s and >5k sustained random IOPS ● Gigabit ethernet (low latency) 5454
  55. 55. PostgreSQL requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >300MB/s and >5k sustained random IOPS ● 10 gigabit ethernet (low latency) 5555
  56. 56. 56 Summary
  57. 57. 57
  58. 58. 58 Questions?

×