SlideShare a Scribd company logo
Nagios in the Agile / DevOps /
Continuous Deployment World
          Kishore Jalleda
       Director of Operations
             IMVU, Inc
        kjalleda@imvu.com
About IMVU




             2012   2
About IMVU


  Avatar based Social Entertainment destination
  $50+ Million Annual Revenue
  100+ Million Registered Users
  10+ Million Items in Virtual Catalog




                        2012                      3
IMVU Engineering and Continuous Deployment


 ►Doing the Impossible 50 times a day
 ►Continuous deployment (CD) is real
 ►IMVU has been one of the pioneers of CD
 ►DevOps culture is big
 ►No approval needed to ship to 1% of customers


 Check out our engineering blog
  http://engineering.imvu.com/

                          2012                    4
What does this mean ?


 ►Things change quickly
 ►New features add up instantly
 ►Can break frequently
 ►Failures can cascade rapidly
 ►Things can fall through the cracks
 ►Many things change at the same time
 ►Etc



                         2012           5
Insights into Nagios @IMVU
Overview


 ►Nagios Core 3.2.0
 ►800+ Hosts
 ►18000+ Service Checks
 ►Single Nagios Instance
 ►8 cores, 8GB RAM




                      2012   7
Server Lifecycle Management




 Purchase &
   Asset      DHCP,   Preseed,                    Nagios,                           Decommiss
 Manageme             CFEngine   Opspush           Cacti,   CFEngine   Production
                                                                                       ion
               DNS                                 Istatd
     nt




                                           2012                                                 8
[ Operations ] Continuous
       Integration and Deployment




2012                           9
IMVU Asset Database ( AssetDB )


►Built internally by IMVU
►Simple but powerful concept
►Source of truth for everything asset related
►Has information on
  ►Class ( mysql, standard-http-server, redis )
  ►Role ( customer shard, clientdynweb )
  ►Tag (available, no-update )
  ►Attributes (cpu-cores, memory-size, mysql-role )
  ►Much more …

                             2012                     10
Auto generation of Nagios configuration files


#generate_nagios_conf.pl
( most configurations auto generated from AssetDB )




                           2012                       11
Ops Buildbot ( builds, builders/buildslaves )

# svn commit hosts.cfg hostgroups.cfg




                         2012                   12
Opspush ( Operations Push System )


# opspush --comment “xxxxxx” –role nagios

                                                             run “cfagent -v”
                                                             on the box
            --use-last-green-rev
                                                     green

                            check status
  opspush                   of “last build”
                                                                   yes

                                                      red
                                                             --oncall-
                                                             override ?

                                                                         No
                                                                    exit


                                              2012                              13
Product Development




  Ideation, UI                  Monitoring
    Design,                    and Alerting
                 Tech Design                  Production   Maintenance
    Usability                  Coverage..
  Testing, etc                   Nagios




                                   2012                                  14
Tech Designs & New Nagios Alert Requests




                     2012                  15
Nagios Alert Request Template




                     2012       16
Big Data / De-Sharding


 ► Data freshness is critical to help make the right
   business decisions
 ► Nagios used for ETL/DW status and error
   checking
 ► Nagios and Ops embeds can help empower
   your Data Infrastructure team




                         2012                          17
Things will FAIL




2012                      18
How we try to prevent and catch failures




                                        Automated                          3rd party like
     Local                                          Manual QA
                                          Cluster                          webmetrics,
  Acceptance   Hypo Builds   Buildbot               using roll-   Nagios
                                         Immunity                          customers,
     Tests                                             out
                                           (CI)                                 etc




                                          2012                                              19
Cluster Immune System

 Automated push monitoring and rollback !
   Push to               Monitor     Good
    X% of                Critical               Push to
   servers               Metrics                 rest



                          Bad

                                     Bad                   Monitor
                                                           Critical
              Auto Rollback                                Metrics



                                    w00t!, my
                                    change is       Good
                                      Live
Don’t just rely on Standard Metrics




                       2012
Demystifying P1s ( Priority 1 )


 P1: Priority 1 issue impacting live operations
 Phases
 ► Identification (Nagios )
 ► Communication and Declaration
 ► Resolution
 ► Postmortem / 5 Whys / Root Cause Analysis
 ► P1 follow up



                          2012                    22
5 Why / Postmortem (PM) / Root Cause Analysis


 ► 5 Why process
 ► Amazing culture of running blameless
   postmortems
 ► New Nagios checks are the most common
   action Items .
 ► A lot of monitoring and alerting on business
   and application level metrics was originally the
   outcome of PMs



                         2012                         23
Example “5 Whys” Process




                      2012   24
Monitor Business & Application Level Metrics




                        2012                   25
Monitor Response Times


 Load Average is a meaningless number 




                         2012             26
Continuous Monitoring ( Istatd )


 ► Developed by IMVU
 ► Sub 10 sec resolution of data
 ► API to get average, SD, min, max sample count
   for each data point in a graph
 ► Ability to stack multiple graphs on the fly
 ► Long retention times
 ► Releasing as open source this week !!!
 https://github.com/imvu-open/istatd/wiki

                          2012                     27
Istatd: 10 Second Resolution of Data




                        2012           28
Istatd: Stacking graphs on the fly




                          2012       29
Have a “Strategy” for Monitoring
          and Alerting
Our (Nagios) Strategy


 ► Human element of Monitoring and Alerting (
   Nagios )
 ► Nagios & Test Driven Development ( TDD )
 ► Decouple ( Nagios )
 ► Aggregated Checks




                         2012                   31
Human Element of Monitoring and Alerting


 ► Have zero tolerance towards False Positives.
   You do not want your ops staff to walk into the
   office next AM looking like zombies ;)
 ► Do not let people develop immunity to pages as
   very soon real issues will be ignored
 ► All pages are Actionable policy: If there is no
   action, it should not be paging
 ► Automatic enabling of alerting/notifications for
   improperly silenced ones.
 ► Ownership and accountability of issues/alerts
                         2012                         32
Daily Triage of Nagios Alerts and Interrupts




                         2012                  33
Nagios & Test Driven Development (TDD)


 ► Write tests for your Nagios Infrastructure
 ► Adopted heavily by Ops ( imp to keep pace
   with eng, DevOps culture is awesome  )
 ► High degree of confidence in pushing changes
 ► Things will eventually change ( OS, libraries,
   logic, people, Nagios version, etc ). Tests will
   make the change much smoother.
 ► Functional testing can still be a challenge


                          2012                        34
Sample Nagios Test Output




                      2012   35
Decouple Nagios

 We do it using “Fact, Worker, Reporter & Aggregator” Model


         Worker
                              fact


                             fact
                                               Redis
        Reporter
                           fact status



                                fact status
        Aggregator




                                    2012                      36
Why Decouple ?


  For scalability and efficiency
  Our model was higher performing compared to
   NRPE
  Lets you make changes ( like thresholds ) in
   one place instead of on like a 1000 machines (
   if using NRPE )
  Lets you do aggregated checks, which is again
   a very simple but powerful concept to reduce
   paging levels by a ton


                         2012                       37
Closing Remarks
Closing Remarks


 ► Monitoring and Alerting (M&A) is mission critical for
   any business, invest properly and smartly in it
 ► Don’t limit the usage of Nagios to just Ops. The secret
   to wide spread adoption is to make things frictionless
 ► Bathroom breaks can take 5-10 minutes, so don’t fret
   too much about Nagios performance
 ► Build some form of predictive monitoring and alerting
   to catch and alert on change in trends
 ► Invest in configuration automation, validation and
   compliance
 ► Finally, Nagios has been like a Honda, very reliable !!!
                            2012                              39
Questions ?
Thank You !!!




                kjalleda@imvu.com
         We are Hiring: imvu.com/jobs
 Engineering Blog: http://engineering.imvu.com/

                         2012                     41

More Related Content

Similar to Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios
 
About the Zero Deviation Lifecycle
About the Zero Deviation LifecycleAbout the Zero Deviation Lifecycle
About the Zero Deviation Lifecycle
Steve Ross-Talbot
 
World Domination with Pentaho EE?
World Domination with Pentaho EE?World Domination with Pentaho EE?
World Domination with Pentaho EE?
Jos van Dongen
 
Fact2009 How To Operationalize Your Strategies
Fact2009 How To Operationalize Your StrategiesFact2009 How To Operationalize Your Strategies
Fact2009 How To Operationalize Your Strategies
syosko
 
Adopting Agile Tools & Methods In A Legacy Context
Adopting Agile Tools & Methods In A Legacy ContextAdopting Agile Tools & Methods In A Legacy Context
Adopting Agile Tools & Methods In A Legacy Context
Xavier Warzee
 
Powerpoint fujitsu
Powerpoint    fujitsuPowerpoint    fujitsu
Powerpoint fujitsu
aiimnevada
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
YashrajNayak4
 
New Product Introduction - Launching Success!
New Product Introduction - Launching Success! New Product Introduction - Launching Success!
New Product Introduction - Launching Success!
Product Realization Group
 
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajulCWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
Capgemini
 
DevOps monitoring: Feedback loops in enterprise environments
DevOps monitoring: Feedback loops in enterprise environmentsDevOps monitoring: Feedback loops in enterprise environments
DevOps monitoring: Feedback loops in enterprise environments
Jonah Kowall
 
Implementing Lean Six Sigma for IT
Implementing Lean Six Sigma for ITImplementing Lean Six Sigma for IT
Implementing Lean Six Sigma for IT
prashanthi_ks
 
Real-time Manufacturing Management for a Hybrid Process
Real-time Manufacturing Management for a Hybrid ProcessReal-time Manufacturing Management for a Hybrid Process
Real-time Manufacturing Management for a Hybrid Process
michaelthonea
 
Copenhagen 121127 - Lars Irenius
Copenhagen 121127 - Lars IreniusCopenhagen 121127 - Lars Irenius
Copenhagen 121127 - Lars Irenius
Knowit_TM
 
Tba Honky Tonk
Tba Honky TonkTba Honky Tonk
Tba Honky Tonk
Block One
 
28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields
Mohit Sharma (GAICD)
 
Empirical Evidence Of Agile Methods
Empirical Evidence Of Agile MethodsEmpirical Evidence Of Agile Methods
Empirical Evidence Of Agile Methods
Grigori Melnik
 
[DSBW Spring 2009] Unit 03: WebEng Process Models
[DSBW Spring 2009] Unit 03: WebEng Process Models[DSBW Spring 2009] Unit 03: WebEng Process Models
[DSBW Spring 2009] Unit 03: WebEng Process Models
Carles Farré
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios
 
06 operations and feedback dap-kabel
06   operations and feedback dap-kabel06   operations and feedback dap-kabel
06 operations and feedback dap-kabel
David Alvarez Palomo
 
Embrace private cloud with confidence
Embrace private cloud with confidenceEmbrace private cloud with confidence
Embrace private cloud with confidence
ManageEngine, Zoho Corporation
 

Similar to Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World (20)

Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User ExperienceNagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
Nagios Conference 2012 - Nathan Vonnahme - Monitoring the User Experience
 
About the Zero Deviation Lifecycle
About the Zero Deviation LifecycleAbout the Zero Deviation Lifecycle
About the Zero Deviation Lifecycle
 
World Domination with Pentaho EE?
World Domination with Pentaho EE?World Domination with Pentaho EE?
World Domination with Pentaho EE?
 
Fact2009 How To Operationalize Your Strategies
Fact2009 How To Operationalize Your StrategiesFact2009 How To Operationalize Your Strategies
Fact2009 How To Operationalize Your Strategies
 
Adopting Agile Tools & Methods In A Legacy Context
Adopting Agile Tools & Methods In A Legacy ContextAdopting Agile Tools & Methods In A Legacy Context
Adopting Agile Tools & Methods In A Legacy Context
 
Powerpoint fujitsu
Powerpoint    fujitsuPowerpoint    fujitsu
Powerpoint fujitsu
 
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 MeetupPreparing for Neo - Singapore OutSystems User Group October 2022 Meetup
Preparing for Neo - Singapore OutSystems User Group October 2022 Meetup
 
New Product Introduction - Launching Success!
New Product Introduction - Launching Success! New Product Introduction - Launching Success!
New Product Introduction - Launching Success!
 
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajulCWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
CWIN17 Toulouse / Safe 4.5 and agile devops-ca technologies-r.bajul
 
DevOps monitoring: Feedback loops in enterprise environments
DevOps monitoring: Feedback loops in enterprise environmentsDevOps monitoring: Feedback loops in enterprise environments
DevOps monitoring: Feedback loops in enterprise environments
 
Implementing Lean Six Sigma for IT
Implementing Lean Six Sigma for ITImplementing Lean Six Sigma for IT
Implementing Lean Six Sigma for IT
 
Real-time Manufacturing Management for a Hybrid Process
Real-time Manufacturing Management for a Hybrid ProcessReal-time Manufacturing Management for a Hybrid Process
Real-time Manufacturing Management for a Hybrid Process
 
Copenhagen 121127 - Lars Irenius
Copenhagen 121127 - Lars IreniusCopenhagen 121127 - Lars Irenius
Copenhagen 121127 - Lars Irenius
 
Tba Honky Tonk
Tba Honky TonkTba Honky Tonk
Tba Honky Tonk
 
28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields28022017 Simen Munter Mindfields
28022017 Simen Munter Mindfields
 
Empirical Evidence Of Agile Methods
Empirical Evidence Of Agile MethodsEmpirical Evidence Of Agile Methods
Empirical Evidence Of Agile Methods
 
[DSBW Spring 2009] Unit 03: WebEng Process Models
[DSBW Spring 2009] Unit 03: WebEng Process Models[DSBW Spring 2009] Unit 03: WebEng Process Models
[DSBW Spring 2009] Unit 03: WebEng Process Models
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
 
06 operations and feedback dap-kabel
06   operations and feedback dap-kabel06   operations and feedback dap-kabel
06 operations and feedback dap-kabel
 
Embrace private cloud with confidence
Embrace private cloud with confidenceEmbrace private cloud with confidence
Embrace private cloud with confidence
 

More from Nagios

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
Nagios
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
Nagios
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
Nagios
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
Nagios
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Nagios
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
Nagios
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
Nagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
Nagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Nagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
Nagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
Nagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios
 

More from Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 

Recently uploaded

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

  • 1. Nagios in the Agile / DevOps / Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc kjalleda@imvu.com
  • 2. About IMVU 2012 2
  • 3. About IMVU Avatar based Social Entertainment destination $50+ Million Annual Revenue 100+ Million Registered Users 10+ Million Items in Virtual Catalog 2012 3
  • 4. IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment (CD) is real ►IMVU has been one of the pioneers of CD ►DevOps culture is big ►No approval needed to ship to 1% of customers Check out our engineering blog http://engineering.imvu.com/ 2012 4
  • 5. What does this mean ? ►Things change quickly ►New features add up instantly ►Can break frequently ►Failures can cascade rapidly ►Things can fall through the cracks ►Many things change at the same time ►Etc 2012 5
  • 7. Overview ►Nagios Core 3.2.0 ►800+ Hosts ►18000+ Service Checks ►Single Nagios Instance ►8 cores, 8GB RAM 2012 7
  • 8. Server Lifecycle Management Purchase & Asset DHCP, Preseed, Nagios, Decommiss Manageme CFEngine Opspush Cacti, CFEngine Production ion DNS Istatd nt 2012 8
  • 9. [ Operations ] Continuous Integration and Deployment 2012 9
  • 10. IMVU Asset Database ( AssetDB ) ►Built internally by IMVU ►Simple but powerful concept ►Source of truth for everything asset related ►Has information on ►Class ( mysql, standard-http-server, redis ) ►Role ( customer shard, clientdynweb ) ►Tag (available, no-update ) ►Attributes (cpu-cores, memory-size, mysql-role ) ►Much more … 2012 10
  • 11. Auto generation of Nagios configuration files #generate_nagios_conf.pl ( most configurations auto generated from AssetDB ) 2012 11
  • 12. Ops Buildbot ( builds, builders/buildslaves ) # svn commit hosts.cfg hostgroups.cfg 2012 12
  • 13. Opspush ( Operations Push System ) # opspush --comment “xxxxxx” –role nagios run “cfagent -v” on the box --use-last-green-rev green check status opspush of “last build” yes red --oncall- override ? No exit 2012 13
  • 14. Product Development Ideation, UI Monitoring Design, and Alerting Tech Design Production Maintenance Usability Coverage.. Testing, etc Nagios 2012 14
  • 15. Tech Designs & New Nagios Alert Requests 2012 15
  • 16. Nagios Alert Request Template 2012 16
  • 17. Big Data / De-Sharding ► Data freshness is critical to help make the right business decisions ► Nagios used for ETL/DW status and error checking ► Nagios and Ops embeds can help empower your Data Infrastructure team 2012 17
  • 19. How we try to prevent and catch failures Automated 3rd party like Local Manual QA Cluster webmetrics, Acceptance Hypo Builds Buildbot using roll- Nagios Immunity customers, Tests out (CI) etc 2012 19
  • 20. Cluster Immune System Automated push monitoring and rollback ! Push to Monitor Good X% of Critical Push to servers Metrics rest Bad Bad Monitor Critical Auto Rollback Metrics w00t!, my change is Good Live
  • 21. Don’t just rely on Standard Metrics 2012
  • 22. Demystifying P1s ( Priority 1 ) P1: Priority 1 issue impacting live operations Phases ► Identification (Nagios ) ► Communication and Declaration ► Resolution ► Postmortem / 5 Whys / Root Cause Analysis ► P1 follow up 2012 22
  • 23. 5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ► Amazing culture of running blameless postmortems ► New Nagios checks are the most common action Items . ► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs 2012 23
  • 24. Example “5 Whys” Process 2012 24
  • 25. Monitor Business & Application Level Metrics 2012 25
  • 26. Monitor Response Times Load Average is a meaningless number  2012 26
  • 27. Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution of data ► API to get average, SD, min, max sample count for each data point in a graph ► Ability to stack multiple graphs on the fly ► Long retention times ► Releasing as open source this week !!! https://github.com/imvu-open/istatd/wiki 2012 27
  • 28. Istatd: 10 Second Resolution of Data 2012 28
  • 29. Istatd: Stacking graphs on the fly 2012 29
  • 30. Have a “Strategy” for Monitoring and Alerting
  • 31. Our (Nagios) Strategy ► Human element of Monitoring and Alerting ( Nagios ) ► Nagios & Test Driven Development ( TDD ) ► Decouple ( Nagios ) ► Aggregated Checks 2012 31
  • 32. Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;) ► Do not let people develop immunity to pages as very soon real issues will be ignored ► All pages are Actionable policy: If there is no action, it should not be paging ► Automatic enabling of alerting/notifications for improperly silenced ones. ► Ownership and accountability of issues/alerts 2012 32
  • 33. Daily Triage of Nagios Alerts and Interrupts 2012 33
  • 34. Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome  ) ► High degree of confidence in pushing changes ► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother. ► Functional testing can still be a challenge 2012 34
  • 35. Sample Nagios Test Output 2012 35
  • 36. Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact fact Redis Reporter fact status fact status Aggregator 2012 36
  • 37. Why Decouple ?  For scalability and efficiency  Our model was higher performing compared to NRPE  Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )  Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton 2012 37
  • 39. Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it ► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless ► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance ► Build some form of predictive monitoring and alerting to catch and alert on change in trends ► Invest in configuration automation, validation and compliance ► Finally, Nagios has been like a Honda, very reliable !!! 2012 39
  • 41. Thank You !!! kjalleda@imvu.com We are Hiring: imvu.com/jobs Engineering Blog: http://engineering.imvu.com/ 2012 41