Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World
Upcoming SlideShare
Loading in...5
×
 

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World

on

  • 1,455 views

Kishore Jalleda's presentation on using Nagios in a continuous development environment. ...

Kishore Jalleda's presentation on using Nagios in a continuous development environment.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

Statistics

Views

Total Views
1,455
Views on SlideShare
1,051
Embed Views
404

Actions

Likes
0
Downloads
7
Comments
0

15 Embeds 404

http://jalleda.blogspot.in 166
http://jalleda.blogspot.com 155
http://jalleda.blogspot.co.uk 29
http://exchange.nagios.org 19
http://jalleda.blogspot.ru 7
http://jalleda.blogspot.ca 6
http://jalleda.blogspot.sg 6
https://www.linkedin.com 5
http://www.linkedin.com 4
http://jalleda.blogspot.com.au 2
http://www.jalleda.blogspot.sg 1
http://jalleda.blogspot.tw 1
http://jalleda.blogspot.fr 1
http://jalleda.blogspot.ie 1
http://jalleda.blogspot.kr 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continuous Deployment World Presentation Transcript

  • Nagios in the Agile / DevOps /Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc kjalleda@imvu.com
  • About IMVU 2012 2
  • About IMVU Avatar based Social Entertainment destination $50+ Million Annual Revenue 100+ Million Registered Users 10+ Million Items in Virtual Catalog 2012 3
  • IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment (CD) is real ►IMVU has been one of the pioneers of CD ►DevOps culture is big ►No approval needed to ship to 1% of customers Check out our engineering blog http://engineering.imvu.com/ 2012 4
  • What does this mean ? ►Things change quickly ►New features add up instantly ►Can break frequently ►Failures can cascade rapidly ►Things can fall through the cracks ►Many things change at the same time ►Etc 2012 5
  • Insights into Nagios @IMVU
  • Overview ►Nagios Core 3.2.0 ►800+ Hosts ►18000+ Service Checks ►Single Nagios Instance ►8 cores, 8GB RAM 2012 7
  • Server Lifecycle Management Purchase & Asset DHCP, Preseed, Nagios, Decommiss Manageme CFEngine Opspush Cacti, CFEngine Production ion DNS Istatd nt 2012 8
  • [ Operations ] Continuous Integration and Deployment2012 9
  • IMVU Asset Database ( AssetDB )►Built internally by IMVU►Simple but powerful concept►Source of truth for everything asset related►Has information on ►Class ( mysql, standard-http-server, redis ) ►Role ( customer shard, clientdynweb ) ►Tag (available, no-update ) ►Attributes (cpu-cores, memory-size, mysql-role ) ►Much more … 2012 10
  • Auto generation of Nagios configuration files#generate_nagios_conf.pl( most configurations auto generated from AssetDB ) 2012 11
  • Ops Buildbot ( builds, builders/buildslaves )# svn commit hosts.cfg hostgroups.cfg 2012 12
  • Opspush ( Operations Push System )# opspush --comment “xxxxxx” –role nagios run “cfagent -v” on the box --use-last-green-rev green check status opspush of “last build” yes red --oncall- override ? No exit 2012 13
  • Product Development Ideation, UI Monitoring Design, and Alerting Tech Design Production Maintenance Usability Coverage.. Testing, etc Nagios 2012 14
  • Tech Designs & New Nagios Alert Requests 2012 15
  • Nagios Alert Request Template 2012 16
  • Big Data / De-Sharding ► Data freshness is critical to help make the right business decisions ► Nagios used for ETL/DW status and error checking ► Nagios and Ops embeds can help empower your Data Infrastructure team 2012 17
  • Things will FAIL2012 18
  • How we try to prevent and catch failures Automated 3rd party like Local Manual QA Cluster webmetrics, Acceptance Hypo Builds Buildbot using roll- Nagios Immunity customers, Tests out (CI) etc 2012 19
  • Cluster Immune System Automated push monitoring and rollback ! Push to Monitor Good X% of Critical Push to servers Metrics rest Bad Bad Monitor Critical Auto Rollback Metrics w00t!, my change is Good Live
  • Don’t just rely on Standard Metrics 2012
  • Demystifying P1s ( Priority 1 ) P1: Priority 1 issue impacting live operations Phases ► Identification (Nagios ) ► Communication and Declaration ► Resolution ► Postmortem / 5 Whys / Root Cause Analysis ► P1 follow up 2012 22
  • 5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ► Amazing culture of running blameless postmortems ► New Nagios checks are the most common action Items . ► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs 2012 23
  • Example “5 Whys” Process 2012 24
  • Monitor Business & Application Level Metrics 2012 25
  • Monitor Response Times Load Average is a meaningless number  2012 26
  • Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution of data ► API to get average, SD, min, max sample count for each data point in a graph ► Ability to stack multiple graphs on the fly ► Long retention times ► Releasing as open source this week !!! https://github.com/imvu-open/istatd/wiki 2012 27
  • Istatd: 10 Second Resolution of Data 2012 28
  • Istatd: Stacking graphs on the fly 2012 29
  • Have a “Strategy” for Monitoring and Alerting
  • Our (Nagios) Strategy ► Human element of Monitoring and Alerting ( Nagios ) ► Nagios & Test Driven Development ( TDD ) ► Decouple ( Nagios ) ► Aggregated Checks 2012 31
  • Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;) ► Do not let people develop immunity to pages as very soon real issues will be ignored ► All pages are Actionable policy: If there is no action, it should not be paging ► Automatic enabling of alerting/notifications for improperly silenced ones. ► Ownership and accountability of issues/alerts 2012 32
  • Daily Triage of Nagios Alerts and Interrupts 2012 33
  • Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ► Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome  ) ► High degree of confidence in pushing changes ► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother. ► Functional testing can still be a challenge 2012 34
  • Sample Nagios Test Output 2012 35
  • Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact fact Redis Reporter fact status fact status Aggregator 2012 36
  • Why Decouple ?  For scalability and efficiency  Our model was higher performing compared to NRPE  Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE )  Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton 2012 37
  • Closing Remarks
  • Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it ► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless ► Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance ► Build some form of predictive monitoring and alerting to catch and alert on change in trends ► Invest in configuration automation, validation and compliance ► Finally, Nagios has been like a Honda, very reliable !!! 2012 39
  • Questions ?
  • Thank You !!! kjalleda@imvu.com We are Hiring: imvu.com/jobs Engineering Blog: http://engineering.imvu.com/ 2012 41