Automated Push Monitoring and Rollback @IMVUArchCamp Lightning talks <br />KishoreJalleda<br />Director of Operations<br /...
How did it all start?<br />From a P1 back in 2007. <br />Site issues <br />Ops and eng identify the bad revision <br />Eng...
How did it all start? (cont’d)<br />Postmortem time ( 5 whys ) <br />Run multiple website revisions for fast rollbacks <br...
More evolution<br />We had more P1’s and more Postmortems and more follow ups.  <br />Identified some common root causes<b...
Push Monitoring & Auto Rollback<br />Phase 1: <br />Push to small % of servers <br />Monitor pre and post push key metrics...
Push Monitoring & Auto Rollback(Cont’d)<br />Phase 2: <br />Push to rest of servers <br />Monitor pre & post push key metr...
What if your push gets rolled back ?<br />You get an email with subject “rollback of r107767” <br />The body contains some...
Push Status Page <br />
 More evolution<br />The below hacks evolved from more Postmortems / 5 Whys <br />Regret your last push ?,  “imvu_oops” to...
Expect some hurdles<br />Don’t expect your push monitoring to catch everything, remember not all changes cause immediate i...
 Thank You!<br />KishoreJalleda<br />kjalleda@imvu.com<br />IMVU recognized as:Inc. 500:  http://bit.ly/dv52wK <br />     ...
Upcoming SlideShare
Loading in...5
×

Automated push monitoring and rollback at imvu

1,279

Published on

Archcamp Lightning talk on "Metrics, Collection, and Immune Systems"

Kishore Jalleda
Director of Operations
IMVU, Inc

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,279
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Automated push monitoring and rollback at imvu

  1. 1. Automated Push Monitoring and Rollback @IMVUArchCamp Lightning talks <br />KishoreJalleda<br />Director of Operations<br />IMVU, Inc<br />
  2. 2. How did it all start?<br />From a P1 back in 2007. <br />Site issues <br />Ops and eng identify the bad revision <br />Engineers commit a fix<br />Wait for BB to go green <br />Finally all tests pass <br />Push the fix and the site recovers <br />Too bad we were down for 20 minutes<br />
  3. 3. How did it all start? (cont’d)<br />Postmortem time ( 5 whys ) <br />Run multiple website revisions for fast rollbacks <br />Typical web server directory structure <br />website -> /home/webadmin/website.107847 ( symlink)<br />website.107767<br />website.107788<br />website.107825<br />website.107834<br />website.107835<br />website.107847<br />
  4. 4. More evolution<br />We had more P1’s and more Postmortems and more follow ups. <br />Identified some common root causes<br />Finding changes in key metrics was manual and sometimes took days or even weeks. <br />rolling back was fully scripted but required a manual trigger <br />Push monitoring and auto rollbacks was born<br />
  5. 5. Push Monitoring & Auto Rollback<br />Phase 1: <br />Push to small % of servers <br />Monitor pre and post push key metrics <br />Key metrics OK ? <br />Go to Phase 2 <br />Key metrics not OK ? <br />Rollback to previous green revision ( simple symlink switch, takes seconds ) <br />
  6. 6. Push Monitoring & Auto Rollback(Cont’d)<br />Phase 2: <br />Push to rest of servers <br />Monitor pre & post push key metrics <br />Key metrics OK ? <br />Push successful  <br />Key metrics not OK ? <br />Rollback to previous green revision ( simple symlink switch, takes seconds ) <br />
  7. 7. What if your push gets rolled back ?<br />You get an email with subject “rollback of r107767” <br />The body contains something like this <br />Revision 107767 triggered an alarm in the cluster and was automatically rolled back to revision 107764<br />Details: https://foo.imvu.com/push_yyyy.php?push_phase_id=384000<br />kjalleda initiated the push at Fri May 13 14:46:38 2011.<br />
  8. 8. Push Status Page <br />
  9. 9. More evolution<br />The below hacks evolved from more Postmortems / 5 Whys <br />Regret your last push ?, “imvu_oops” to the rescue. Along with rolling back to a previous good revision, this will also lock commits, pushes, and sends an email to ops, eng, and on-call. <br />Ability to manually rollback quickly without having to go through commit/BB/push<br />Ability to manually push a particular revision <br />Ability to manually lock commits and or pushes <br />Push system itself is broken, now what ? ( its a P1 at IMVU ) <br />Automated rollbacks on any metric inaccessibility <br />Immune system for IMVU config variables ( site switches ) <br /> <br />
  10. 10. Expect some hurdles<br />Don’t expect your push monitoring to catch everything, remember not all changes cause immediate impact, some take days or even weeks to surface<br />There are inevitably going to be false positives / Intermittent issues due to a variety of reasons. <br />Push settings/thresholds may need periodic tweaking to accommodate some cluster changes <br />Ongoing production issues can skew some metrics which can impact pushes<br />Rollbacks from un-related errors are a pain to deal with. <br />
  11. 11. Thank You!<br />KishoreJalleda<br />kjalleda@imvu.com<br />IMVU recognized as:Inc. 500:  http://bit.ly/dv52wK <br /> Red Herring 100:  http://bit.ly/bbz5Ex <br /> Best Place to Work:  http://bit.ly/aAVdp8 <br /> (and we're hiring): http://www.imvu.com/jobs <br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×