Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Box.com success story: migrating 350K Nagios objects to Sensu

82 views

Published on

More than 41 million users and 74,000 businesses — including 59% of the Fortune 500 — trust Box to manage content in the cloud. They were monitoring this web scale infrastructure with Nagios, and not able to keep up with the rapid pace of change inside of Box. In this talk from Sensu Summit 2018, Trent Baker, Senior Infrastructure Site Reliability Engineer at Box, Inc., tells their migration story from wrestling with management of 350K objects in Nagios – including over 130K checks – to shutting down the last Nagios host roughly a year later.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Box.com success story: migrating 350K Nagios objects to Sensu

  1. 1. The box.com success story: migrating 350K Nagios objects to Sensu Trent Baker, Senior Infrastructure SRE
  2. 2. 2The box.com success story: migrating 350K Nagios objects to Sensu Trent Baker Senior Infrastructure SRE Box
  3. 3. Agenda / What is Box? / Infrastructure Overview / Nagios: Legacy monitoring / Sensu: Next generation monitoring / Nagios to Sensu migration / The end results / What’s next? / Questions and Answers
  4. 4. 4The box.com success story: migrating 350K Nagios objects to Sensu / Content Collaboration Platform company leader / Redwood City headquarters, with offices in Europe, Asia, and Australia / Approximately 1300 employees / Customers: 82,000 enterprise, 11M individual / Vision: Build amazing products that power how people work together Value: Blow our customers minds! What is Box?
  5. 5. 5The box.com success story: migrating 350K Nagios objects to Sensu Infrastructure SRE Services Infrastructure SREs Mission: Design and Build services in a hybrid infrastructure that are highly available, flexible, scalable, secure, and global. Authentication Bastion Infrastructure Configuration Management Domain Name Service Monitoring and Alerting Provisioning Repositories Ben Parli Danny McCarthy David Chan Guarav Jain Luke James Mani Hashemi Trent Baker Steve Zerbe
  6. 6. 6The box.com success story: migrating 350K Nagios objects to Sensu / Hybrid Architecture: Bare metal, private cloud, public cloud / ~16,000 compute nodes and growing / 350K Nagios objects: (hosts, contacts, services) x (groups), notifications, commands / Legacy Nagios infrastructure struggled to accommodate growth / Sensu’s design accommodates high-growth environments Value: 10x it! Infrastructure Overview
  7. 7. 7The box.com success story: migrating 350K Nagios objects to Sensu / Single Nagios master, multiple Nagios slaves per datacenter, Nagios Core 3.5.1 / Single, viewable pane via Thruk, 1.58 / Nagios slaves executed active checks / Nagios master received passive check via NSCA / Nagios master acted as fall-over for Nagios Slaves / Nagios Architecture was tightly coupled with Puppet through exported resources / Nagios masters processed 350K objects Architecture Nagios: Legacy monitoring
  8. 8. 8The box.com success story: migrating 350K Nagios objects to Sensu / Nagios masters and slaves were single points of failure / Nagios masters were not scalable horizontally / Tight coupling with Puppet using exported resources caused environment convergence delays / Adding or Removing hosts, hostgroups, and checks was complicated and took hours / Routine maintenance caused alert storms and loss of telemetry / Simple changes caused configuration errors Limitations Nagios: Legacy monitoring
  9. 9. 9The box.com success story: migrating 350K Nagios objects to Sensu / Tried to upgrade Nagios and Thruk / Split Nagios cluster into server and network / Tried to implement other Nagios distributed solutions: Mod-Gearman / Offloaded metrics checks to time series monitoring service / Realized there was no path forward with Nagios Work-arounds and attempted solutions Nagios: Legacy monitoring
  10. 10. 10The box.com success story: migrating 350K Nagios objects to Sensu / Needed to green-field the next-generation monitoring solution / Evaluated 6 to 8 monitoring solutions / Sensu met a long list of ~75 requirements / Monitor entire hybrid environment / No single points of failures / Easily deployed, scaled, and maintained / Integrate with Pagerduty, Slack, Email, Jira, Puppet, LDAP, SAML / Use existing Nagios plugins / Proof of Concept validated Sensu selection Path Forward Sensu: Next Generation Monitoring
  11. 11. 11The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Architecture
  12. 12. 12The box.com success story: migrating 350K Nagios objects to Sensu / Greenfield needed some work / Moved custom redis and rabbitmq modules to box_redis and box_rabbitmq / Upgraded puppet modules: stdlib, apt / Upgraded erlang and open jdk / Deployed standard forge modules for Sensu, RabbitMQ, and Redis / Created Puppet node definition, roles, and profiles / Established easy to use frameworks with puppet profiles and hiera Implementation Sensu: Next Generation Monitoring
  13. 13. 13The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: Puppetfile and node definition ←Puppetfile Puppet Nodes Definition
  14. 14. 14The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: roles and profiles Roles Puppet Profiles
  15. 15. 15The box.com success story: migrating 350K Nagios objects to Sensu Sensu: Next Generation Monitoring Implementation: hiera and contacts.pp Hiera common.json Profile Contacts.pp
  16. 16. 16The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu mapping Nagios to Sensu Migration nagios::client::add_to_hostgroup sensu::subscription nagios::magic::add_to_hostgroup sensu::subscription Nagios::object::{service|hostgroup|command} sensu::check
  17. 17. 17The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to check mapping
  18. 18. 18The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to aggregate check mapping
  19. 19. 19The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Hostgroup to aggregate check mapping continue
  20. 20. 20The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Sensu aggregate check api configuration
  21. 21. 21The box.com success story: migrating 350K Nagios objects to Sensu Nagios to Sensu Migration Sensu aggregate subscription
  22. 22. 22The box.com success story: migrating 350K Nagios objects to Sensu / 16,000+ hosts registered / ~1,250 checks migrated / Deployment improvements / Scaling improvements / Availability improvements / Administration improvements / Productivity increases Positioned for growth The end results
  23. 23. 23The box.com success story: migrating 350K Nagios objects to Sensu / Network Nagios to Sensu migration / Wavefront integration / Filter utilization / Single sign on implementation / Auto-remediation: StackStorm / Sensu 2.0 evaluation Infrastructure SRE future projects What’s Next?
  24. 24. 24The box.com success story: migrating 350K Nagios objects to Sensu MAKE MOM PROUD.
  25. 25. Thank You! Questions? Trent Baker tbaker@box.com

×