More than 41 million users and 74,000 businesses — including 59% of the Fortune 500 — trust Box to manage content in the cloud. They were monitoring this web scale infrastructure with Nagios, and not able to keep up with the rapid pace of change inside of Box. In this talk from Sensu Summit 2018, Trent Baker, Senior Infrastructure Site Reliability Engineer at Box, Inc., tells their migration story from wrestling with management of 350K objects in Nagios – including over 130K checks – to shutting down the last Nagios host roughly a year later.
Generative AI for Technical Writer or Information Developers
The Box.com success story: migrating 350K Nagios objects to Sensu
1. The box.com success story:
migrating 350K Nagios objects to Sensu
Trent Baker, Senior Infrastructure SRE
2. 2The box.com success story: migrating 350K Nagios objects to Sensu
Trent Baker
Senior Infrastructure SRE
Box
3. Agenda / What is Box?
/ Infrastructure Overview
/ Nagios: Legacy monitoring
/ Sensu: Next generation monitoring
/ Nagios to Sensu migration
/ The end results
/ What’s next?
/ Questions and Answers
4. 4The box.com success story: migrating 350K Nagios objects to Sensu
/ Content Collaboration Platform company
leader
/ Redwood City headquarters, with offices
in Europe, Asia, and Australia
/ Approximately 1300 employees
/ Customers: 82,000 enterprise, 11M
individual
/ Vision: Build amazing products that
power how people work together
Value: Blow our customers minds!
What is Box?
5. 5The box.com success story: migrating 350K Nagios objects to Sensu
Infrastructure SRE Services Infrastructure SREs
Mission:
Design and Build services in a
hybrid infrastructure that are
highly available, flexible, scalable,
secure, and global.
Authentication
Bastion Infrastructure
Configuration Management
Domain Name Service
Monitoring and Alerting
Provisioning
Repositories
Ben Parli
Danny McCarthy
David Chan
Guarav Jain
Luke James
Mani Hashemi
Trent Baker
Steve Zerbe
6. 6The box.com success story: migrating 350K Nagios objects to Sensu
/ Hybrid Architecture: Bare metal, private
cloud, public cloud
/ ~16,000 compute nodes and growing
/ 350K Nagios objects: (hosts, contacts,
services) x (groups), notifications, commands
/ Legacy Nagios infrastructure struggled to
accommodate growth
/ Sensu’s design accommodates high-growth
environments
Value: 10x it!
Infrastructure Overview
7. 7The box.com success story: migrating 350K Nagios objects to Sensu
/ Single Nagios master, multiple Nagios slaves per
datacenter, Nagios Core 3.5.1
/ Single, viewable pane via Thruk, 1.58
/ Nagios slaves executed active checks
/ Nagios master received passive check via NSCA
/ Nagios master acted as fall-over for Nagios Slaves
/ Nagios Architecture was tightly coupled with
Puppet through exported resources
/ Nagios masters processed 350K objects
Architecture
Nagios: Legacy monitoring
8. 8The box.com success story: migrating 350K Nagios objects to Sensu
/ Nagios masters and slaves were single points of failure
/ Nagios masters were not scalable horizontally
/ Tight coupling with Puppet using exported resources
caused environment convergence delays
/ Adding or Removing hosts, hostgroups, and checks was
complicated and took hours
/ Routine maintenance caused alert storms and loss of
telemetry
/ Simple changes caused configuration errors
Limitations
Nagios: Legacy monitoring
9. 9The box.com success story: migrating 350K Nagios objects to Sensu
/ Tried to upgrade Nagios and Thruk
/ Split Nagios cluster into server and network
/ Tried to implement other Nagios distributed
solutions: Mod-Gearman
/ Offloaded metrics checks to time series
monitoring service
/ Realized there was no path forward with
Nagios
Work-arounds and attempted solutions
Nagios: Legacy monitoring
10. 10The box.com success story: migrating 350K Nagios objects to Sensu
/ Needed to green-field the next-generation monitoring
solution
/ Evaluated 6 to 8 monitoring solutions
/ Sensu met a long list of ~75 requirements
/ Monitor entire hybrid environment
/ No single points of failures
/ Easily deployed, scaled, and maintained
/ Integrate with Pagerduty, Slack, Email, Jira, Puppet,
LDAP, SAML
/ Use existing Nagios plugins
/ Proof of Concept validated Sensu selection
Path Forward
Sensu: Next Generation
Monitoring
11. 11The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Architecture
12. 12The box.com success story: migrating 350K Nagios objects to Sensu
/ Greenfield needed some work
/ Moved custom redis and rabbitmq
modules to box_redis and box_rabbitmq
/ Upgraded puppet modules: stdlib, apt
/ Upgraded erlang and open jdk
/ Deployed standard forge modules for Sensu,
RabbitMQ, and Redis
/ Created Puppet node definition, roles, and
profiles
/ Established easy to use frameworks with
puppet profiles and hiera
Implementation
Sensu: Next Generation
Monitoring
13. 13The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: Puppetfile and node definition
←Puppetfile Puppet Nodes Definition
14. 14The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: roles and profiles
Roles Puppet Profiles
15. 15The box.com success story: migrating 350K Nagios objects to Sensu
Sensu: Next Generation Monitoring
Implementation: hiera and contacts.pp
Hiera common.json
Profile Contacts.pp
16. 16The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu mapping
Nagios to Sensu Migration
nagios::client::add_to_hostgroup sensu::subscription
nagios::magic::add_to_hostgroup sensu::subscription
Nagios::object::{service|hostgroup|command} sensu::check
17. 17The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to check mapping
18. 18The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to aggregate check mapping
19. 19The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Hostgroup to aggregate check mapping continue
20. 20The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Sensu aggregate check api configuration
21. 21The box.com success story: migrating 350K Nagios objects to Sensu
Nagios to Sensu Migration
Sensu aggregate subscription
22. 22The box.com success story: migrating 350K Nagios objects to Sensu
/ 16,000+ hosts registered
/ ~1,250 checks migrated
/ Deployment improvements
/ Scaling improvements
/ Availability improvements
/ Administration improvements
/ Productivity increases
Positioned for growth
The end results
23. 23The box.com success story: migrating 350K Nagios objects to Sensu
/ Network Nagios to Sensu migration
/ Wavefront integration
/ Filter utilization
/ Single sign on implementation
/ Auto-remediation: StackStorm
/ Sensu 2.0 evaluation
Infrastructure SRE future projects
What’s Next?