Nagios in the Agile / DevOps /Continuous Deployment World          Kishore Jalleda       Director of Operations           ...
About IMVU             2012   2
About IMVU  Avatar based Social Entertainment destination  $50+ Million Annual Revenue  100+ Million Registered Users  10+...
IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment (CD) is real ►IMVU ...
What does this mean ? ►Things change quickly ►New features add up instantly ►Can break frequently ►Failures can cascade ra...
Insights into Nagios @IMVU
Overview ►Nagios Core 3.2.0 ►800+ Hosts ►18000+ Service Checks ►Single Nagios Instance ►8 cores, 8GB RAM                  ...
Server Lifecycle Management Purchase &   Asset      DHCP,   Preseed,                    Nagios,                           ...
[ Operations ] Continuous       Integration and Deployment2012                           9
IMVU Asset Database ( AssetDB )►Built internally by IMVU►Simple but powerful concept►Source of truth for everything asset ...
Auto generation of Nagios configuration files#generate_nagios_conf.pl( most configurations auto generated from AssetDB )  ...
Ops Buildbot ( builds, builders/buildslaves )# svn commit hosts.cfg hostgroups.cfg                         2012           ...
Opspush ( Operations Push System )# opspush --comment “xxxxxx” –role nagios                                               ...
Product Development  Ideation, UI                  Monitoring    Design,                    and Alerting                 T...
Tech Designs & New Nagios Alert Requests                     2012                  15
Nagios Alert Request Template                     2012       16
Big Data / De-Sharding ► Data freshness is critical to help make the right   business decisions ► Nagios used for ETL/DW s...
Things will FAIL2012                      18
How we try to prevent and catch failures                                        Automated                          3rd par...
Cluster Immune System Automated push monitoring and rollback !   Push to               Monitor     Good    X% of          ...
Don’t just rely on Standard Metrics                       2012
Demystifying P1s ( Priority 1 ) P1: Priority 1 issue impacting live operations Phases ► Identification (Nagios ) ► Communi...
5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ► Amazing culture of running blameless   postmortems ► New N...
Example “5 Whys” Process                      2012   24
Monitor Business & Application Level Metrics                        2012                   25
Monitor Response Times Load Average is a meaningless number                          2012             26
Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution of data ► API to get average, SD, min, max sa...
Istatd: 10 Second Resolution of Data                        2012           28
Istatd: Stacking graphs on the fly                          2012       29
Have a “Strategy” for Monitoring          and Alerting
Our (Nagios) Strategy ► Human element of Monitoring and Alerting (   Nagios ) ► Nagios & Test Driven Development ( TDD ) ►...
Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives.   You do not want your ops staff t...
Daily Triage of Nagios Alerts and Interrupts                         2012                  33
Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ► Adopted heavily by Ops ( imp to keep...
Sample Nagios Test Output                      2012   35
Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model         Worker                              fac...
Why Decouple ?  For scalability and efficiency  Our model was higher performing compared to   NRPE  Lets you make chang...
Closing Remarks
Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for   any business, invest properly and smartly in it ...
Questions ?
Thank You !!!                kjalleda@imvu.com         We are Hiring: imvu.com/jobs Engineering Blog: http://engineering.i...
Upcoming SlideShare
Loading in …5
×

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

2,016 views

Published on

Kishore Jalleda's presentation on using Nagios in a continuous development environment.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,016
On SlideShare
0
From Embeds
0
Number of Embeds
742
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

  1. 1. Nagios in the Agile / DevOps /Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc kjalleda@imvu.com
  2. 2. About IMVU 2012 2
  3. 3. About IMVU Avatar based Social Entertainment destination $50+ Million Annual Revenue 100+ Million Registered Users 10+ Million Items in Virtual Catalog 2012 3
  4. 4. IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment (CD) is real ►IMVU has been one of the pioneers of CD ►DevOps culture is big ►No approval needed to ship to 1% of customers Check out our engineering blog http://engineering.imvu.com/ 2012 4
  5. 5. What does this mean ? ►Things change quickly ►New features add up instantly ►Can break frequently ►Failures can cascade rapidly ►Things can fall through the cracks ►Many things change at the same time ►Etc 2012 5
  6. 6. Insights into Nagios @IMVU
  7. 7. Overview ►Nagios Core 3.2.0 ►800+ Hosts ►18000+ Service Checks ►Single Nagios Instance ►8 cores, 8GB RAM 2012 7
  8. 8. Server Lifecycle Management Purchase & Asset DHCP, Preseed, Nagios, Decommiss Manageme CFEngine Opspush Cacti, CFEngine Production ion DNS Istatd nt 2012 8
  9. 9. [ Operations ] Continuous Integration and Deployment2012 9
  10. 10. IMVU Asset Database ( AssetDB )►Built internally by IMVU►Simple but powerful concept►Source of truth for everything asset related►Has information on ►Class ( mysql, standard-http-server, redis ) ►Role ( customer shard, clientdynweb ) ►Tag (available, no-update ) ►Attributes (cpu-cores, memory-size, mysql-role ) ►Much more … 2012 10
  11. 11. Auto generation of Nagios configuration files#generate_nagios_conf.pl( most configurations auto generated from AssetDB ) 2012 11
  12. 12. Ops Buildbot ( builds, builders/buildslaves )# svn commit hosts.cfg hostgroups.cfg 2012 12
  13. 13. Opspush ( Operations Push System )# opspush --comment “xxxxxx” –role nagios run “cfagent -v” on the box --use-last-green-rev green check status opspush of “last build” yes red --oncall- override ? No exit 2012 13
  14. 14. Product Development Ideation, UI Monitoring Design, and Alerting Tech Design Production Maintenance Usability Coverage.. Testing, etc Nagios 2012 14
  15. 15. Tech Designs & New Nagios Alert Requests 2012 15
  16. 16. Nagios Alert Request Template 2012 16
  17. 17. Big Data / De-Sharding ► Data freshness is critical to help make the right business decisions ► Nagios used for ETL/DW status and error checking ► Nagios and Ops embeds can help empower your Data Infrastructure team 2012 17
  18. 18. Things will FAIL2012 18
  19. 19. How we try to prevent and catch failures Automated 3rd party like Local Manual QA Cluster webmetrics, Acceptance Hypo Builds Buildbot using roll- Nagios Immunity customers, Tests out (CI) etc 2012 19
  20. 20. Cluster Immune System Automated push monitoring and rollback ! Push to Monitor Good X% of Critical Push to servers Metrics rest Bad Bad Monitor Critical Auto Rollback Metrics w00t!, my change is Good Live
  21. 21. Don’t just rely on Standard Metrics 2012
  22. 22. Demystifying P1s ( Priority 1 ) P1: Priority 1 issue impacting live operations Phases ► Identification (Nagios ) ► Communication and Declaration ► Resolution ► Postmortem / 5 Whys / Root Cause Analysis ► P1 follow up 2012 22
  23. 23. 5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ► Amazing culture of running blameless postmortems ► New Nagios checks are the most common action Items . ► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs 2012 23
  24. 24. Example “5 Whys” Process 2012 24
  25. 25. Monitor Business & Application Level Metrics 2012 25
  26. 26. Monitor Response Times Load Average is a meaningless number  2012 26
  27. 27. Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution of data ► API to get average, SD, min, max sample count for each data point in a graph ► Ability to stack multiple graphs on the fly ► Long retention times ► Releasing as open source this week !!! https://github.com/imvu-open/istatd/wiki 2012 27
  28. 28. Istatd: 10 Second Resolution of Data 2012 28
  29. 29. Istatd: Stacking graphs on the fly 2012 29
  30. 30. Have a “Strategy” for Monitoring and Alerting
  31. 31. Our (Nagios) Strategy ► Human element of Monitoring and Alerting ( Nagios ) ► Nagios & Test Driven Development ( TDD ) ► Decouple ( Nagios ) ► Aggregated Checks 2012 31
  32. 32. Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;) ► Do not let people develop immunity to pages as very soon real issues will be ignored ► All pages are Actionable policy: If there is no action, it should not be paging ► Automatic enabling of alerting/notifications for improperly silenced ones. ► Ownership and accountability of issues/alerts 2012 32
  33. 33. Daily Triage of Nagios Alerts and Interrupts 2012 33
  34. 34. Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome  ) ► High degree of confidence in pushing changes ► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother. ► Functional testing can still be a challenge 2012 34
  35. 35. Sample Nagios Test Output 2012 35
  36. 36. Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact fact Redis Reporter fact status fact status Aggregator 2012 36
  37. 37. Why Decouple ?  For scalability and efficiency  Our model was higher performing compared to NRPE  Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )  Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton 2012 37
  38. 38. Closing Remarks
  39. 39. Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it ► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless ► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance ► Build some form of predictive monitoring and alerting to catch and alert on change in trends ► Invest in configuration automation, validation and compliance ► Finally, Nagios has been like a Honda, very reliable !!! 2012 39
  40. 40. Questions ?
  41. 41. Thank You !!! kjalleda@imvu.com We are Hiring: imvu.com/jobs Engineering Blog: http://engineering.imvu.com/ 2012 41

×