SlideShare a Scribd company logo
1 of 25
The box.com success story:
migrating 350K Nagios objects to Sensu
Trent Baker, Senior Infrastructure SRE
2The box.com success story: migrating 350K Nagios objects to Sensu
Trent Baker
Senior Infrastructure SRE
Box
Agenda / What is Box?
/ Infrastructure Overview
/ Nagios: Legacy monitoring
/ Sensu: Next generation monitoring
/ Nagios to Sensu migration
/ The end results
/ What’s next?
/ Questions and Answers
4The box.com success story: migrating 350K Nagios objects to Sensu
/ Content Collaboration Platform company
leader
/ Redwood City headquarters, with offices
in Europe, Asia, and Australia
/ Approximately 1300 employees
/ Customers: 82,000 enterprise, 11M
individual
/ Vision: Build amazing products that
power how people work together
Value: Blow our customers minds!
What is Box?
5The box.com success story: migrating 350K Nagios objects to Sensu
Infrastructure SRE Services Infrastructure SREs
Mission:
Design and Build services in a
hybrid infrastructure that are
highly available, flexible, scalable,
secure, and global.
Authentication
Bastion Infrastructure
Configuration Management
Domain Name Service
Monitoring and Alerting
Provisioning
Repositories
Ben Parli
Danny McCarthy
David Chan
Guarav Jain
Luke James
Mani Hashemi
Trent Baker
Steve Zerbe
6The box.com success story: migrating 350K Nagios objects to Sensu
/ Hybrid Architecture: Bare metal, private
cloud, public cloud
/ ~16,000 compute nodes and growing
/ 350K Nagios objects: (hosts, contacts,
services) x (groups), notifications, commands
/ Legacy Nagios infrastructure struggled to
accommodate growth
/ Sensu’s design accommodates high-growth
environments
Value: 10x it!
Infrastructure Overview
7The box.com success story: migrating 350K Nagios objects to Sensu
/ Single Nagios master, multiple Nagios slaves per
datacenter, Nagios Core 3.5.1
/ Single, viewable pane via Thruk, 1.58
/ Nagios slaves executed active checks
/ Nagios master received passive check via NSCA
/ Nagios master acted as fall-over for Nagios Slaves
/ Nagios Architecture was tightly coupled with
Puppet through exported resources
/ Nagios masters processed 350K objects
Architecture
Nagios: Legacy monitoring
8The box.com success story: migrating 350K Nagios objects to Sensu
/ Nagios masters and slaves were single points of failure
/ Nagios masters were not scalable horizontally
/ Tight coupling with Puppet using exported resources
caused environment convergence delays
/ Adding or Removing hosts, hostgroups, and checks was
complicated and took hours
/ Routine maintenance caused alert storms and loss of
telemetry
/ Simple changes caused configuration errors
Limitations
Nagios: Legacy monitoring
9The box.com success story: migrating 350K Nagios objects to Sensu
/ Tried to upgrade Nagios and Thruk
/ Split Nagios cluster into server and network
/ Tried to implement other Nagios distributed
solutions: Mod-Gearman
/ Offloaded metrics checks to time series
monitoring service
/ Realized there was no path forward with
Nagios
Work-arounds and attempted solutions
Nagios: Legacy monitoring
10The box.com success story: migrating 350K Nagios objects to Sensu
/ Needed to green-field the next-generation monitoring
solution
/ Evaluated 6 to 8 monitoring solutions
/ Sensu met a long list of ~75 requirements
/ Monitor entire hybrid environment
/ No single points of failures
/ Easily deployed, scaled, and maintained
/ Integrate with Pagerduty, Slack, Email, Jira, Puppet,
LDAP, SAML
/ Use existing Nagios plugins
/ Proof of Concept validated Sensu selection
Path Forward
Sensu: Next Generation
Monitoring
11The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Architecture
12The box.com success story: migrating 350K Nagios objects to Sensu
/ Greenfield needed some work
/ Moved custom redis and rabbitmq
modules to box_redis and box_rabbitmq
/ Upgraded puppet modules: stdlib, apt
/ Upgraded erlang and open jdk
/ Deployed standard forge modules for Sensu,
RabbitMQ, and Redis
/ Created Puppet node definition, roles, and
profiles
/ Established easy to use frameworks with
puppet profiles and hiera
Implementation
Sensu: Next Generation
Monitoring
13The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: Puppetfile and node definition
←Puppetfile Puppet Nodes Definition
14The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: roles and profiles
Roles Puppet Profiles
15The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: hiera and contacts.pp
Hiera common.json
Profile Contacts.pp
16The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu mapping
Nagios to Sensu Migration
nagios::client::add_to_hostgroup sensu::subscription
nagios::magic::add_to_hostgroup sensu::subscription
Nagios::object::{service|hostgroup|command} sensu::check
17The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to check mapping
18The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to aggregate check mapping
19The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to aggregate check mapping continue
20The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Sensu aggregate check api configuration
21The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Sensu aggregate subscription
22The box.com success story: migrating 350K Nagios objects to Sensu
/ 16,000+ hosts registered
/ ~1,250 checks migrated
/ Deployment improvements
/ Scaling improvements
/ Availability improvements
/ Administration improvements
/ Productivity increases
Positioned for growth
The end results
23The box.com success story: migrating 350K Nagios objects to Sensu
/ Network Nagios to Sensu migration
/ Wavefront integration
/ Filter utilization
/ Single sign on implementation
/ Auto-remediation: StackStorm
/ Sensu 2.0 evaluation
Infrastructure SRE future projects
What’s Next?
24The box.com success story: migrating 350K Nagios objects to Sensu
MAKE MOM PROUD.
Thank You!
Questions?
Trent Baker
tbaker@box.com

More Related Content

Similar to The Box.com success story: migrating 350K Nagios objects to Sensu

Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018Laure Vergeron
 
2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 School2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 SchoolScality
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...Daniel Bryant
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610Cisco DevNet
 
CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017Stacy Véronneau
 
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...CodeValue
 
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 update
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 updateDrupal 8 and 9, Backwards Compatibility, and Drupal 8.5 update
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 updateAngela Byron
 
Laying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on SparkLaying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on SparkIonic Security
 
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8Angela Byron
 
Event sourcing and CQRS: Lessons from the trenches
Event sourcing and CQRS: Lessons from the trenchesEvent sourcing and CQRS: Lessons from the trenches
Event sourcing and CQRS: Lessons from the trenchesDavid Jiménez Martínez
 
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Laure Vergeron
 
OCTO On-Site Off-Site Update on D8 Roadmap
OCTO On-Site Off-Site Update on D8 RoadmapOCTO On-Site Off-Site Update on D8 Roadmap
OCTO On-Site Off-Site Update on D8 RoadmapAngela Byron
 
AWS November meetup Slides
AWS November meetup SlidesAWS November meetup Slides
AWS November meetup SlidesJacksonMorgan9
 
IoT and Cloud services interactions
IoT and Cloud services interactionsIoT and Cloud services interactions
IoT and Cloud services interactionsAGILE IoT
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Josh Ghiloni
 
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
 Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ... Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...MayaData Inc
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Laure Vergeron
 

Similar to The Box.com success story: migrating 350K Nagios objects to Sensu (20)

Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018Zenko @Cloud Native Foundation London Meetup March 6th 2018
Zenko @Cloud Native Foundation London Meetup March 6th 2018
 
2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 School2017 Hackathon Scality & 42 School
2017 Hackathon Scality & 42 School
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...
CloudNativeLondon 2017: "What is a Service Mesh, and Do I Need One when Devel...
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610
Webex APIs for Admins - Cisco Live Orlando 2018 - DEVNET-3610
 
CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017CENGN - OpenStack MeetUp - March 2017
CENGN - OpenStack MeetUp - March 2017
 
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...
Alex Pshul: What We Learned by Testing Execution of 300K Messages/Min in a Se...
 
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 update
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 updateDrupal 8 and 9, Backwards Compatibility, and Drupal 8.5 update
Drupal 8 and 9, Backwards Compatibility, and Drupal 8.5 update
 
Laying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on SparkLaying the Foundation for Ionic Platform Insights on Spark
Laying the Foundation for Ionic Platform Insights on Spark
 
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8
Drupal 9 and Backwards Compatibility: Why now is the time to upgrade to Drupal 8
 
Event sourcing and CQRS: Lessons from the trenches
Event sourcing and CQRS: Lessons from the trenchesEvent sourcing and CQRS: Lessons from the trenches
Event sourcing and CQRS: Lessons from the trenches
 
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
Docker Meetup Tokyo #23 - Zenko Open Source Multi-Cloud Data Controller - Lau...
 
OCTO On-Site Off-Site Update on D8 Roadmap
OCTO On-Site Off-Site Update on D8 RoadmapOCTO On-Site Off-Site Update on D8 Roadmap
OCTO On-Site Off-Site Update on D8 Roadmap
 
AWS November meetup Slides
AWS November meetup SlidesAWS November meetup Slides
AWS November meetup Slides
 
AWS User Group November
AWS User Group NovemberAWS User Group November
AWS User Group November
 
IoT and Cloud services interactions
IoT and Cloud services interactionsIoT and Cloud services interactions
IoT and Cloud services interactions
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
 
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
 Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ... Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
Use GitLab with Chaos Engineering to Harden your Applications + OpenEBS 1.3 ...
 
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
Zenko & MetalK8s @ Dublin Docker Meetup, June 2018
 

More from Sensu Inc.

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Sensu Inc.
 
Monitoring Graceful Failure
Monitoring Graceful FailureMonitoring Graceful Failure
Monitoring Graceful FailureSensu Inc.
 
The Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resourcesThe Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resourcesSensu Inc.
 
PPB's Sensu Journey
PPB's Sensu JourneyPPB's Sensu Journey
PPB's Sensu JourneySensu Inc.
 
Testing and monitoring and broken things
Testing and monitoring and broken thingsTesting and monitoring and broken things
Testing and monitoring and broken thingsSensu Inc.
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationSensu Inc.
 
Keynote: Measuring the right things
Keynote: Measuring the right thingsKeynote: Measuring the right things
Keynote: Measuring the right thingsSensu Inc.
 
Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu GoSensu Inc.
 
Keynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control planeKeynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control planeSensu Inc.
 
AIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationAIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationSensu Inc.
 
Ecosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetEcosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetSensu Inc.
 
Herding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middlewareHerding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middlewareSensu Inc.
 
7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and Soon7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and SoonSensu Inc.
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Sensu Inc.
 
Assets in Sensu 2.0
Assets in Sensu 2.0Assets in Sensu 2.0
Assets in Sensu 2.0Sensu Inc.
 
Project 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingProject 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingSensu Inc.
 
Sharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSensu Inc.
 
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuWhere's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuSensu Inc.
 
Reimagining Sensu
Reimagining SensuReimagining Sensu
Reimagining SensuSensu Inc.
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionSensu Inc.
 

More from Sensu Inc. (20)

Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
Introducing GoAlert: a brand-new on-call scheduling and notification open sou...
 
Monitoring Graceful Failure
Monitoring Graceful FailureMonitoring Graceful Failure
Monitoring Graceful Failure
 
The Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resourcesThe Bonsai Asset Index : A new way for the community to share resources
The Bonsai Asset Index : A new way for the community to share resources
 
PPB's Sensu Journey
PPB's Sensu JourneyPPB's Sensu Journey
PPB's Sensu Journey
 
Testing and monitoring and broken things
Testing and monitoring and broken thingsTesting and monitoring and broken things
Testing and monitoring and broken things
 
Order from chaos: automating monitoring configuration
Order from chaos: automating monitoring configurationOrder from chaos: automating monitoring configuration
Order from chaos: automating monitoring configuration
 
Keynote: Measuring the right things
Keynote: Measuring the right thingsKeynote: Measuring the right things
Keynote: Measuring the right things
 
Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu Go
 
Keynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control planeKeynote: Sensu as a multi-cloud monitoring control plane
Keynote: Sensu as a multi-cloud monitoring control plane
 
AIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital TransformationAIOps & Observability to Lead Your Digital Transformation
AIOps & Observability to Lead Your Digital Transformation
 
Ecosystem session: Sensu + Puppet
Ecosystem session: Sensu + PuppetEcosystem session: Sensu + Puppet
Ecosystem session: Sensu + Puppet
 
Herding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middlewareHerding cats & catching fire: Workday's telemetry & middleware
Herding cats & catching fire: Workday's telemetry & middleware
 
7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and Soon7 Years of Sensu: Then, Now, and Soon
7 Years of Sensu: Then, Now, and Soon
 
Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...Pull, don’t push: Architectures for monitoring and configuration in a microse...
Pull, don’t push: Architectures for monitoring and configuration in a microse...
 
Assets in Sensu 2.0
Assets in Sensu 2.0Assets in Sensu 2.0
Assets in Sensu 2.0
 
Project 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and MessagingProject 3M: Meaningful Monitoring and Messaging
Project 3M: Meaningful Monitoring and Messaging
 
Sharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using AnsibleSharing Sensu with Multiple Teams using Ansible
Sharing Sensu with Multiple Teams using Ansible
 
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & SensuWhere's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
Where's My Beer: Building a Better Kegerator with a Raspberry Pi & Sensu
 
Reimagining Sensu
Reimagining SensuReimagining Sensu
Reimagining Sensu
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course Correction
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

The Box.com success story: migrating 350K Nagios objects to Sensu

  • 1. The box.com success story: migrating 350K Nagios objects to Sensu Trent Baker, Senior Infrastructure SRE
  • 2. 2The box.com success story: migrating 350K Nagios objects to Sensu Trent Baker Senior Infrastructure SRE Box
  • 3. Agenda / What is Box? / Infrastructure Overview / Nagios: Legacy monitoring / Sensu: Next generation monitoring / Nagios to Sensu migration / The end results / What’s next? / Questions and Answers
  • 4. 4The box.com success story: migrating 350K Nagios objects to Sensu / Content Collaboration Platform company leader / Redwood City headquarters, with offices in Europe, Asia, and Australia / Approximately 1300 employees / Customers: 82,000 enterprise, 11M individual / Vision: Build amazing products that power how people work together Value: Blow our customers minds! What is Box?
  • 5. 5The box.com success story: migrating 350K Nagios objects to Sensu Infrastructure SRE Services Infrastructure SREs Mission: Design and Build services in a hybrid infrastructure that are highly available, flexible, scalable, secure, and global. Authentication Bastion Infrastructure Configuration Management Domain Name Service Monitoring and Alerting Provisioning Repositories Ben Parli Danny McCarthy David Chan Guarav Jain Luke James Mani Hashemi Trent Baker Steve Zerbe
  • 6. 6The box.com success story: migrating 350K Nagios objects to Sensu / Hybrid Architecture: Bare metal, private cloud, public cloud / ~16,000 compute nodes and growing / 350K Nagios objects: (hosts, contacts, services) x (groups), notifications, commands / Legacy Nagios infrastructure struggled to accommodate growth / Sensu’s design accommodates high-growth environments Value: 10x it! Infrastructure Overview
  • 7. 7The box.com success story: migrating 350K Nagios objects to Sensu / Single Nagios master, multiple Nagios slaves per datacenter, Nagios Core 3.5.1 / Single, viewable pane via Thruk, 1.58 / Nagios slaves executed active checks / Nagios master received passive check via NSCA / Nagios master acted as fall-over for Nagios Slaves / Nagios Architecture was tightly coupled with Puppet through exported resources / Nagios masters processed 350K objects Architecture Nagios: Legacy monitoring
  • 8. 8The box.com success story: migrating 350K Nagios objects to Sensu / Nagios masters and slaves were single points of failure / Nagios masters were not scalable horizontally / Tight coupling with Puppet using exported resources caused environment convergence delays / Adding or Removing hosts, hostgroups, and checks was complicated and took hours / Routine maintenance caused alert storms and loss of telemetry / Simple changes caused configuration errors Limitations Nagios: Legacy monitoring
  • 9. 9The box.com success story: migrating 350K Nagios objects to Sensu / Tried to upgrade Nagios and Thruk / Split Nagios cluster into server and network / Tried to implement other Nagios distributed solutions: Mod-Gearman / Offloaded metrics checks to time series monitoring service / Realized there was no path forward with Nagios Work-arounds and attempted solutions Nagios: Legacy monitoring
  • 10. 10The box.com success story: migrating 350K Nagios objects to Sensu / Needed to green-field the next-generation monitoring solution / Evaluated 6 to 8 monitoring solutions / Sensu met a long list of ~75 requirements / Monitor entire hybrid environment / No single points of failures / Easily deployed, scaled, and maintained / Integrate with Pagerduty, Slack, Email, Jira, Puppet, LDAP, SAML / Use existing Nagios plugins / Proof of Concept validated Sensu selection Path Forward Sensu: Next Generation Monitoring
  • 11. 11The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Architecture
  • 12. 12The box.com success story: migrating 350K Nagios objects to Sensu / Greenfield needed some work / Moved custom redis and rabbitmq modules to box_redis and box_rabbitmq / Upgraded puppet modules: stdlib, apt / Upgraded erlang and open jdk / Deployed standard forge modules for Sensu, RabbitMQ, and Redis / Created Puppet node definition, roles, and profiles / Established easy to use frameworks with puppet profiles and hiera Implementation Sensu: Next Generation Monitoring
  • 13. 13The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: Puppetfile and node definition ←Puppetfile Puppet Nodes Definition
  • 14. 14The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: roles and profiles Roles Puppet Profiles
  • 15. 15The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: hiera and contacts.pp Hiera common.json Profile Contacts.pp
  • 16. 16The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu mapping Nagios to Sensu Migration nagios::client::add_to_hostgroup sensu::subscription nagios::magic::add_to_hostgroup sensu::subscription Nagios::object::{service|hostgroup|command} sensu::check
  • 17. 17The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to check mapping
  • 18. 18The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to aggregate check mapping
  • 19. 19The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to aggregate check mapping continue
  • 20. 20The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Sensu aggregate check api configuration
  • 21. 21The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Sensu aggregate subscription
  • 22. 22The box.com success story: migrating 350K Nagios objects to Sensu / 16,000+ hosts registered / ~1,250 checks migrated / Deployment improvements / Scaling improvements / Availability improvements / Administration improvements / Productivity increases Positioned for growth The end results
  • 23. 23The box.com success story: migrating 350K Nagios objects to Sensu / Network Nagios to Sensu migration / Wavefront integration / Filter utilization / Single sign on implementation / Auto-remediation: StackStorm / Sensu 2.0 evaluation Infrastructure SRE future projects What’s Next?
  • 24. 24The box.com success story: migrating 350K Nagios objects to Sensu MAKE MOM PROUD.