SlideShare a Scribd company logo
1 of 58
Download to read offline
Scaling Sensu Go
By Sean Porter,
Co-founder & CTO.
Who am I?
● Creator of Sensu
● Co-founder
● CTO
● PorterTech
2
Overview
1. How we 10X’d performance in 6 months
2. Deployment architectures
3. Hardware recommendations
4. Summary
5. Questions
3
4
Goals for Sensu Go
5
6
Scale
7
In terms of:
● Performance
● Organization
GA 8
December 5th, 2018
● Steep learning curve
● Requires RabbitMQ and Redis expertise
● Capable of scaling*
Scaling Sensu Core (1.X)
9
Scaling Sensu Core (1.X)
10
Scaling Sensu Core (1.X)
11
12
Step 1 - Instrument
13
● Used AWS EC2
● M5.2xlarge to i3.metal
● Agent session load tool
● Disappointing results (~5k)
● Inconsistent
Step 2 - Test environment
14
Step 3 - Get serious
15
16
Spent $10k on gaming hardware.
17
● Control
● Consistency
● Capacity
Why bear bare metal?
18
19
● AMD Threadripper 2920X (12 Cores, 3.5GHz)
● Gigabyte X399 AORUS PRO
● 16GB DDR4 2666MHz CL16 (2x 8GB)
● Two Intel 660p Series M.2 PCIe 512GB SSDs
● Intel Gigabit CT PCIe Network Card
Backend hardware
20
● AMD Threadripper 2990WX (32 Cores, 3.0GHz)
● Gigabyte X399 AORUS PRO
● 32GB DDR4 2666MHz CL16 (4x 8GB)
● Intel 660p Series M.2 PCIe 512GB SSD
Agents hardware
21
● Two Ubiquiti UniFi 8 Port 60W Switches
● Separate load tool and data planes
Network hardware
22
23
● Consistently delivered disappointing results!
Agents: 4,000
Checks: 8 at 5s interval
Events/s: 6,400
● Produced data!
The first results
24
● Identified several possible bottlenecks
● Identified bugs while under load!
● Began experimentation...
The first results
25
● Sensu Events!
● ~95% of etcd write operations
● Disabled Event persistence - 11,200 Events/s
● etcd max database size (10GB*)
● Needed to move the workload
The primary offender
26
27
28
● AMD Threadripper 2920X (12 Cores, 3.5GHz)
● Gigabyte X399 AORUS PRO
● 16GB DDR4 2666MHz CL16 (2x 8GB)
● Two Intel 660p Series M.2 PCIe 512GB SSDs
● Three Intel Gigabit CT PCIe Network Card
PostgreSQL hardware
29
Agents: 4,000
Checks: 14 at 5s interval
Events/s: 11,200
Not good enough!
New results with PostgreSQL
3030
● Multi-Version Concurrency Control
● Many updates - need aggressive auto-vacuuming!
vacuum_cost_delay = 10ms
vacuum_cost_limit = 10000
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025
PostgreSQL tuning
31
● Tune write-ahead logging
● Reduce the number of disk writes
wal_sync_method = fdatasync
wal_writer_delay = 5000ms
max_wal_size = 5GB
min_wal_size = 1GB
PostgreSQL tuning
32
● Burying Check TTL switch set on every Event!
● Additional etcd PUT and DELETE operations
A huge bug!
33
Agents: 4,000
Checks: 40 at 5s interval
Events/s: 32,000
Much better! Still not good enough.
New results with bug fix
3434
● Several etcd range (reads) requests per Event
● Caching reduced etcd range requests by 50%
● No improvement to Event throughput :(
Entity and silenced caches
35
● Every object is serialized for transport and storage
● Changed from JSON to Protobuf
○ Applied to Agent transport and etcd store
○ Reduced serialized object size!
○ Less CPU time
Serialization
36
● Increased Backend internal queue lengths
○ From 100 to 1000 (made configurable)
● Increased Backend internal worker counts
○ From 100 to 1000 (made configurable)
● Increases concurrency and absorbs latency spikes
Internal queues and workers
37
Agents: 36,000
Checks: 38 at 10s interval (4 subscriptions)
Events/s: 34,200
Almost there!!!
New results
3838
39
Agents: 40,000
Checks: 38 at 10s interval (4 subscriptions)
Events/s: 38,000
New results
4040
41
● https://github.com/sensu/sensu-perf
● Performance tests are reproducible
● Users can test their own deployments!
● Now part of release QA!
The performance project
42
43
What’s next for scaling Sensu?
Multi-site Federation
● 40,000 Agents per cluster
● Run multiple/distributed Sensu Go clusters
● Centralized RBAC policy management
● Centralized visibility via the WebUI
44
45
Deployment architectures
46
47
48
49
50
51
52
53
Hardware recommendations*
Backend requirements
● 16 vCPU
● 16GB memory
● Attached NVMe SSD
○ >50MB/s and >5k sustained random IOPS
● Gigabit ethernet (low latency)
5454
PostgreSQL requirements
● 16 vCPU
● 16GB memory
● Attached NVMe SSD
○ >300MB/s and >5k sustained random IOPS
● 10 gigabit ethernet (low latency)
5555
56
Summary
57
58
Questions?

More Related Content

What's hot

PPB's Sensu Journey
PPB's Sensu JourneyPPB's Sensu Journey
PPB's Sensu JourneySensu Inc.
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionAlexander Kukushkin
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable worldTechExeter
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an actionGordon Chung
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Matthew Campbell
 
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...Puppet
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Zabbix
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Toolsm_richardson
 
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbagGordon Chung
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleBamdad Dashtban
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and presentGordon Chung
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to PriamJason Brown
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackAhmed AbouZaid
 

What's hot (20)

PPB's Sensu Journey
PPB's Sensu JourneyPPB's Sensu Journey
PPB's Sensu Journey
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable world
 
Anatomy of an action
Anatomy of an actionAnatomy of an action
Anatomy of an action
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)
 
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...
Puppet Camp LA 2015: Server Management with Puppet on AWS for a fast-growing ...
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
 
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIsNagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
Nagios Conference 2014 - Janice Singh - Real World Uses for Nagios APIs
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
 
Sensu
SensuSensu
Sensu
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and Ansible
 
Gnocchi v4 - past and present
Gnocchi v4 - past and presentGnocchi v4 - past and present
Gnocchi v4 - past and present
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Introduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK StackIntroduction to InfluxDB and TICK Stack
Introduction to InfluxDB and TICK Stack
 

Similar to Keynote: Scaling Sensu Go

Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCErik Krogen
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projectsDmitriy Dumanskiy
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
Mux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderMux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderAltinity Ltd
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Remote iOS Devices Server – Scaling iOS
Remote iOS Devices Server – Scaling iOSRemote iOS Devices Server – Scaling iOS
Remote iOS Devices Server – Scaling iOSNick Abalov
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGPablo Garbossa
 

Similar to Keynote: Scaling Sensu Go (20)

Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Mux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founderMux loves Clickhouse. By Adam Brown, Mux founder
Mux loves Clickhouse. By Adam Brown, Mux founder
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Remote iOS Devices Server – Scaling iOS
Remote iOS Devices Server – Scaling iOSRemote iOS Devices Server – Scaling iOS
Remote iOS Devices Server – Scaling iOS
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 

More from Sensu Inc.

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Sensu Inc.
 
Monitoring Graceful Failure
Monitoring Graceful FailureMonitoring Graceful Failure
Monitoring Graceful FailureSensu Inc.
 
Testing and monitoring and broken things
Testing and monitoring and broken thingsTesting and monitoring and broken things
Testing and monitoring and broken thingsSensu Inc.
 
Keynote: Measuring the right things
Keynote: Measuring the right thingsKeynote: Measuring the right things
Keynote: Measuring the right thingsSensu Inc.
 
AIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationAIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationSensu Inc.
 
Ecosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetEcosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetSensu Inc.
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Sensu Inc.
 
Assets in Sensu 2.0
Assets in Sensu 2.0Assets in Sensu 2.0
Assets in Sensu 2.0Sensu Inc.
 
The Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to SensuThe Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to SensuSensu Inc.
 
Project 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingProject 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingSensu Inc.
 
Sharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSensu Inc.
 
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuWhere's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuSensu Inc.
 
Reimagining Sensu
Reimagining SensuReimagining Sensu
Reimagining SensuSensu Inc.
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionSensu Inc.
 
Sensu and Kubernetes 1.x
Sensu and Kubernetes 1.xSensu and Kubernetes 1.x
Sensu and Kubernetes 1.xSensu Inc.
 
Sensu and Puppet
Sensu and PuppetSensu and Puppet
Sensu and PuppetSensu Inc.
 

More from Sensu Inc. (16)

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
 
Monitoring Graceful Failure
Monitoring Graceful FailureMonitoring Graceful Failure
Monitoring Graceful Failure
 
Testing and monitoring and broken things
Testing and monitoring and broken thingsTesting and monitoring and broken things
Testing and monitoring and broken things
 
Keynote: Measuring the right things
Keynote: Measuring the right thingsKeynote: Measuring the right things
Keynote: Measuring the right things
 
AIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationAIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital Transformation
 
Ecosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetEcosystem session: Sensu + Puppet
Ecosystem session: Sensu + Puppet
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...
 
Assets in Sensu 2.0
Assets in Sensu 2.0Assets in Sensu 2.0
Assets in Sensu 2.0
 
The Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to SensuThe Box.com success story: migrating 350K Nagios objects to Sensu
The Box.com success story: migrating 350K Nagios objects to Sensu
 
Project 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingProject 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and Messaging
 
Sharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using Ansible
 
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuWhere's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
 
Reimagining Sensu
Reimagining SensuReimagining Sensu
Reimagining Sensu
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course Correction
 
Sensu and Kubernetes 1.x
Sensu and Kubernetes 1.xSensu and Kubernetes 1.x
Sensu and Kubernetes 1.x
 
Sensu and Puppet
Sensu and PuppetSensu and Puppet
Sensu and Puppet
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Keynote: Scaling Sensu Go

  • 1. Scaling Sensu Go By Sean Porter, Co-founder & CTO.
  • 2. Who am I? ● Creator of Sensu ● Co-founder ● CTO ● PorterTech 2
  • 3. Overview 1. How we 10X’d performance in 6 months 2. Deployment architectures 3. Hardware recommendations 4. Summary 5. Questions 3
  • 5. 5
  • 6. 6
  • 7. Scale 7 In terms of: ● Performance ● Organization
  • 9. ● Steep learning curve ● Requires RabbitMQ and Redis expertise ● Capable of scaling* Scaling Sensu Core (1.X) 9
  • 10. Scaling Sensu Core (1.X) 10
  • 11. Scaling Sensu Core (1.X) 11
  • 12. 12
  • 13. Step 1 - Instrument 13
  • 14. ● Used AWS EC2 ● M5.2xlarge to i3.metal ● Agent session load tool ● Disappointing results (~5k) ● Inconsistent Step 2 - Test environment 14
  • 15. Step 3 - Get serious 15
  • 16. 16 Spent $10k on gaming hardware.
  • 17. 17
  • 18. ● Control ● Consistency ● Capacity Why bear bare metal? 18
  • 19. 19
  • 20. ● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Intel Gigabit CT PCIe Network Card Backend hardware 20
  • 21. ● AMD Threadripper 2990WX (32 Cores, 3.0GHz) ● Gigabyte X399 AORUS PRO ● 32GB DDR4 2666MHz CL16 (4x 8GB) ● Intel 660p Series M.2 PCIe 512GB SSD Agents hardware 21
  • 22. ● Two Ubiquiti UniFi 8 Port 60W Switches ● Separate load tool and data planes Network hardware 22
  • 23. 23
  • 24. ● Consistently delivered disappointing results! Agents: 4,000 Checks: 8 at 5s interval Events/s: 6,400 ● Produced data! The first results 24
  • 25. ● Identified several possible bottlenecks ● Identified bugs while under load! ● Began experimentation... The first results 25
  • 26. ● Sensu Events! ● ~95% of etcd write operations ● Disabled Event persistence - 11,200 Events/s ● etcd max database size (10GB*) ● Needed to move the workload The primary offender 26
  • 27. 27
  • 28. 28
  • 29. ● AMD Threadripper 2920X (12 Cores, 3.5GHz) ● Gigabyte X399 AORUS PRO ● 16GB DDR4 2666MHz CL16 (2x 8GB) ● Two Intel 660p Series M.2 PCIe 512GB SSDs ● Three Intel Gigabit CT PCIe Network Card PostgreSQL hardware 29
  • 30. Agents: 4,000 Checks: 14 at 5s interval Events/s: 11,200 Not good enough! New results with PostgreSQL 3030
  • 31. ● Multi-Version Concurrency Control ● Many updates - need aggressive auto-vacuuming! vacuum_cost_delay = 10ms vacuum_cost_limit = 10000 autovacuum_naptime = 10s autovacuum_vacuum_scale_factor = 0.05 autovacuum_analyze_scale_factor = 0.025 PostgreSQL tuning 31
  • 32. ● Tune write-ahead logging ● Reduce the number of disk writes wal_sync_method = fdatasync wal_writer_delay = 5000ms max_wal_size = 5GB min_wal_size = 1GB PostgreSQL tuning 32
  • 33. ● Burying Check TTL switch set on every Event! ● Additional etcd PUT and DELETE operations A huge bug! 33
  • 34. Agents: 4,000 Checks: 40 at 5s interval Events/s: 32,000 Much better! Still not good enough. New results with bug fix 3434
  • 35. ● Several etcd range (reads) requests per Event ● Caching reduced etcd range requests by 50% ● No improvement to Event throughput :( Entity and silenced caches 35
  • 36. ● Every object is serialized for transport and storage ● Changed from JSON to Protobuf ○ Applied to Agent transport and etcd store ○ Reduced serialized object size! ○ Less CPU time Serialization 36
  • 37. ● Increased Backend internal queue lengths ○ From 100 to 1000 (made configurable) ● Increased Backend internal worker counts ○ From 100 to 1000 (made configurable) ● Increases concurrency and absorbs latency spikes Internal queues and workers 37
  • 38. Agents: 36,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 34,200 Almost there!!! New results 3838
  • 39. 39
  • 40. Agents: 40,000 Checks: 38 at 10s interval (4 subscriptions) Events/s: 38,000 New results 4040
  • 41. 41
  • 42. ● https://github.com/sensu/sensu-perf ● Performance tests are reproducible ● Users can test their own deployments! ● Now part of release QA! The performance project 42
  • 43. 43 What’s next for scaling Sensu?
  • 44. Multi-site Federation ● 40,000 Agents per cluster ● Run multiple/distributed Sensu Go clusters ● Centralized RBAC policy management ● Centralized visibility via the WebUI 44
  • 46. 46
  • 47. 47
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52
  • 54. Backend requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >50MB/s and >5k sustained random IOPS ● Gigabit ethernet (low latency) 5454
  • 55. PostgreSQL requirements ● 16 vCPU ● 16GB memory ● Attached NVMe SSD ○ >300MB/s and >5k sustained random IOPS ● 10 gigabit ethernet (low latency) 5555
  • 57. 57