For over eight years, the Sensu community has been using Sensu to monitor their applications and infrastructure at scale. Sensu Go became generally available at the beginning of this year, and was designed to be more portable, easier and faster to deploy, and most importantly: more scalable than ever before! In this talk, Sensu CTO Sean Porter will share Sensu Go scaling patterns, best practices, and case studies. He’ll also explain our design and architectural choices and talk about our plan to take things even further.
29. ● AMD Threadripper 2920X (12 Cores, 3.5GHz)
● Gigabyte X399 AORUS PRO
● 16GB DDR4 2666MHz CL16 (2x 8GB)
● Two Intel 660p Series M.2 PCIe 512GB SSDs
● Three Intel Gigabit CT PCIe Network Card
PostgreSQL hardware
29
30. Agents: 4,000
Checks: 14 at 5s interval
Events/s: 11,200
Not good enough!
New results with PostgreSQL
3030
31. ● Multi-Version Concurrency Control
● Many updates - need aggressive auto-vacuuming!
vacuum_cost_delay = 10ms
vacuum_cost_limit = 10000
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025
PostgreSQL tuning
31
32. ● Tune write-ahead logging
● Reduce the number of disk writes
wal_sync_method = fdatasync
wal_writer_delay = 5000ms
max_wal_size = 5GB
min_wal_size = 1GB
PostgreSQL tuning
32
33. ● Burying Check TTL switch set on every Event!
● Additional etcd PUT and DELETE operations
A huge bug!
33
34. Agents: 4,000
Checks: 40 at 5s interval
Events/s: 32,000
Much better! Still not good enough.
New results with bug fix
3434
35. ● Several etcd range (reads) requests per Event
● Caching reduced etcd range requests by 50%
● No improvement to Event throughput :(
Entity and silenced caches
35
36. ● Every object is serialized for transport and storage
● Changed from JSON to Protobuf
○ Applied to Agent transport and etcd store
○ Reduced serialized object size!
○ Less CPU time
Serialization
36
37. ● Increased Backend internal queue lengths
○ From 100 to 1000 (made configurable)
● Increased Backend internal worker counts
○ From 100 to 1000 (made configurable)
● Increases concurrency and absorbs latency spikes
Internal queues and workers
37
38. Agents: 36,000
Checks: 38 at 10s interval (4 subscriptions)
Events/s: 34,200
Almost there!!!
New results
3838
44. Multi-site Federation
● 40,000 Agents per cluster
● Run multiple/distributed Sensu Go clusters
● Centralized RBAC policy management
● Centralized visibility via the WebUI
44