How Zynga handles monitoring at
scale in its hybrid zCloud

Nov 12th, 2013
Matt West : mwest@zynga.com
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/zynga-zcloud

InfoQ.com: News...
Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the sprea...
With Great Scale, Comes Great Challenges…
• The Sky is Falling!
– Immense alert volumes
– Irregular checks with inconsiste...
Nagios, Gearman, and Mod-Gearman
• Nagios is susceptible to
processing delays for various
reasons.
• Tuning configuration
...
Nagios, Gearman, and Mod-Gearman Continued…
• Nagios Daemon Loads a NEB
(Nagios Event Broker) module.
• NEB Module imports...
Initial Results
• Host Check Latency Times in
seconds
– Standard: min/max/avg
– 107.62;111.23;109.656

– Mod-Gearman: min/...
Initial Results Continued…
• Service Check Latency Times in
seconds
– Standard: min/max/avg
– 47.21;118.17;110.78

– Mod-G...
Initial Results Continued…
• Host and Service Exec Times stay fairly stable.
• Host and Service Latency Times are immediat...
Our Monitoring Scaling Pipe Dream
• Saigon (Centralized Nagios Configuration Management)
– What if some of those initial p...
Saigon and Beanstalkd
• Saigon UI Explanation
– Beanstalkd integration
– RPM Builder
– Configuration Viewing
– Configurati...
Distributed Results with Beanstalkd
• Nagios
– Script sends data to an API to be placed into Beanstalkd.

• Beanstalkd
– R...
Rightscale Cache and Beanstalkd
• External calls to Rightscale API could be untimely.
– Fairly reliable at returning data ...
Software Used
• Nagios : http://nagios.org
o v3.2.3 and v3.5.0
• Gearman : http://gearman.org
o v0.25
• Mod-Gearman : http...
Thank you…

We can now begin the,
Interrogation Gauntlet… ;)

13
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/zyngazcloud
Upcoming SlideShare
Loading in...5
×

How Zynga Handles Monitoring at Scale in Its Hybrid zCloud

337

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1giSlI6.

Matt West explains how to use technologies like CloudStack, Beanstalk, Gearman, mod_gearman, Nagios, nagconf and other tools to monitor large web applications at scale deployed in the zCloud. Filmed at qconsf.com.

Matt West is a systems engineer for Zynga Gaming Inc. and has worked on both sides of the computer lines for 15 years; building and scaling web production infrastructure for a variety of applications. Matt has authored several integral pieces of software to solve the automation puzzle and is a regular contributor to the OpenSource community.

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
337
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

How Zynga Handles Monitoring at Scale in Its Hybrid zCloud

  1. 1. How Zynga handles monitoring at scale in its hybrid zCloud Nov 12th, 2013 Matt West : mwest@zynga.com
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /zynga-zcloud InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. With Great Scale, Comes Great Challenges… • The Sky is Falling! – Immense alert volumes – Irregular checks with inconsistent results – Environment coverage questionability • Touching the Oven! – Investigate a set of standards to provide the customers – Leverage an API for host information – Use asynchronous queues for job execution at scale • Into the Trenches! – Implement findings with customers • Profit! 2
  5. 5. Nagios, Gearman, and Mod-Gearman • Nagios is susceptible to processing delays for various reasons. • Tuning configuration parameters can help. • Gearman / Mod-Gearman gives us access to designated pool of workers. – Can Grow and Shrink with demand from Nagios. – Written in C and Open Source. – Practical and Easy to deploy. Non Mod-Gearman Enabled Nagios Mod-Gearman Enabled Nagios 3
  6. 6. Nagios, Gearman, and Mod-Gearman Continued… • Nagios Daemon Loads a NEB (Nagios Event Broker) module. • NEB Module imports or skips inserting the job to the requested Gearman queue. • Workers running on various servers execute commands on behalf of the Nagios instance. • Workers then inserts the results of executing the command into the results Gearman queue for processing. • Result worker(s) consume the Gearman queue information and pass the information back into the Nagios process for any further processing and handling. 4
  7. 7. Initial Results • Host Check Latency Times in seconds – Standard: min/max/avg – 107.62;111.23;109.656 – Mod-Gearman: min/max/avg – 0;6.59;2.753 – Savings: min/max/avg – 100.00%;94.08%;97.49% • Host Check Execution Times in seconds – Standard: min/max/avg – 3.02;4.03;4.014 – Mod-Gearman: min/max/avg – 0.5;0.53;0.508 – Savings: min/max/avg – 83.44%;86.85%;87.34% 5
  8. 8. Initial Results Continued… • Service Check Latency Times in seconds – Standard: min/max/avg – 47.21;118.17;110.78 – Mod-Gearman: min/max/avg – 0;8.05;0.405 – Savings: min/max/avg – 100.00%;93.19%;99.63% • Service Check Execution Times in seconds – Standard: min/max/avg – 0.01;4.21;0.18 – Mod-Gearman: min/max/avg – 0;3.24;0.121 – Savings: min/max/avg – 100.00%;23.04%;32.78% 6
  9. 9. Initial Results Continued… • Host and Service Exec Times stay fairly stable. • Host and Service Latency Times are immediately reduced. • Achieved even while adding more hosts and services to this cluster. – Standard: – 10452 Services – 1294 Hosts – Mod-Gearman: – 17996 Services – 1374 Hosts – Difference: – 7544 Services (+72.18%) – 80 Hosts (+6.18%) 7
  10. 10. Our Monitoring Scaling Pipe Dream • Saigon (Centralized Nagios Configuration Management) – What if some of those initial problems, weren’t problems because everyone came to us for Nagios solutions. • Distributable Result Information to various customers – What if all the customers could register an API callback for information about Nagios alerts they don’t control. • Increased usage ability of external host information from external APIs. – What if we didn’t have to wait around for host information to come back from off site APIs due to latency issues. 8
  11. 11. Saigon and Beanstalkd • Saigon UI Explanation – Beanstalkd integration – RPM Builder – Configuration Viewing – Configuration Tester – Configuration Version Diffing • Saigon API Explanation – Caching Layer – RESTful Syntax – Scripted Consumers 9
  12. 12. Distributed Results with Beanstalkd • Nagios – Script sends data to an API to be placed into Beanstalkd. • Beanstalkd – Reduces work done by Nagios server to bare minimum. – Possible customers for Nagios results. – Stats and Analytics – Lifetime Server State Change Logs – External Break/Fix Systems 10
  13. 13. Rightscale Cache and Beanstalkd • External calls to Rightscale API could be untimely. – Fairly reliable at returning data small sets of data. – Certain large requests had a 10% chance of failing. • Implemented Host Hot Cache – Leveraged Beanstalkd to manage sub jobs of global jobs. – Beanstalkd is used to keep global re-occurring jobs running. – Hot Cache is completely refreshed once every 4 hours. • Fronted by RESTful API – Allows for single, multi, global host invalidation or revalidation. – Created jobs for surfacing known problems between Cloudstack, Rightscale and our Physical Hosts. 11
  14. 14. Software Used • Nagios : http://nagios.org o v3.2.3 and v3.5.0 • Gearman : http://gearman.org o v0.25 • Mod-Gearman : https://labs.consol.de/lang/en/nagios/mod-gearman/ o v1.4.2 • Beanstalkd : http://kr.github.io/beanstalkd/ o v1.4.6 • Check-MK : http://mathias-kettner.com/check_mk.html o v1.2.2p1 12
  15. 15. Thank you… We can now begin the, Interrogation Gauntlet… ;) 13
  16. 16. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/zyngazcloud

×