Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Upcoming SlideShare
Loading in...5
×
 

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

on

  • 42,140 views

The goal of Skynet is to avoid human doing repetitive things and make a system doing them in a better way. System automation should be the way to go for any system management so that human can focus ...

The goal of Skynet is to avoid human doing repetitive things and make a system doing them in a better way. System automation should be the way to go for any system management so that human can focus on stuff that really matters.

Related blog post for more informations http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/

Statistics

Views

Total Views
42,140
Views on SlideShare
41,371
Embed Views
769

Actions

Likes
129
Downloads
241
Comments
11

24 Embeds 769

http://www.scoop.it 171
http://www.hkmurakami.com 101
http://feedly.com 92
https://www.linkedin.com 85
http://localhost 77
http://eng-blog.sl.ss 50
https://twitter.com 43
http://www.linkedin.com 34
http://www.thebackoftheenvelope.com 24
http://www.wefundtogether.com 21
http://feeds.feedburner.com 18
http://blog01.sl.ss 11
http://newsblur.com 11
http://news.google.com 8
http://digg.com 8
http://www.onlydoo.com 4
http://www.feedspot.com 3
http://www.inoreader.com 2
http://wave.webaim.org 1
http://theofficialandreascy.tumblr.com 1
http://ma.linkedin.com 1
https://reader.aol.com 1
http://10.17.227.241 1
http://www.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 11 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • good job
    Are you sure you want to
    Your message goes here
    Processing…
  • good work
    Are you sure you want to
    Your message goes here
    Processing…
  • @kevin_ka Thank you Kevin, very nice to share your experience on that!
    Are you sure you want to
    Your message goes here
    Processing…
  • Really it's impressionant
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi - sounds like you've made some good progress (probably even more as this was posted a year ago); a couple of suggestions based on some experiences working on Bing...

    Repair actions:
    - if service re-start doesn't work... we would just reboot the box (or VM as appropriate)
    - If re-boot doesn't work, then we would re-image the server
    - If the above doesn't work N times, we assume there is some HW issue and tell the vendor to replace the machine.
    - a by-product of auto-repair is that encourage our dev to build services were built to be very resilient to individual node failures... because our system would kill their process
    - ... in fact, for a very long time, none of our services had any shutdown logic. We eventually (after much debate) added a feature (for the front-end services) that would send a notification they would be killed in 5 seconds so they drain their queues & stop taking new requests

    Other
    - we were always worried about cascading failures where the actions of our system (we called it AutoPilot) could make the problem worse... so we had a throttle of max (%) of machines take action on
    - as you describe, it's important to differentiate a sever / node failure from a system/service level issues and take the appropriate action
    - we tried to keep the controller logic simple as possible; i'd be very hesitant to add deeper logic / learning
    - we found it very useful to look for outliers... some cases we would investigate as it could indicate some code / design defect; most cases treated as an error (which triggered auto-repair actions)
    - we also looked at machine / process up-time as a outlier and trigger auto-repair
    - I'm a huge fan of collection / aggregation of lots of metrics... however, I'm not a big fan of log collection / aggregation - it's a lot of effort / cost (network, storage, etc). For investigating issues, we found distributed grep-like tool met the needs
    - auto-scaling of capacity, as part of our platform/controller, turned out to be quite complex (for us) as we needed to integrate with the load-balancer to manage the vip pool and define a general mechanism to measure if a service over (or under) loaded as a mix of RPS + latency...

    Good luck,

    Kevin



    Thanks,



    Kevin Kaufmann
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud Presentation Transcript

  • Skynet Project Monitor, analyze, scale, and maintain an infrastructure in the Cloud Or everything humans should not do @SylvainKalache
  • What about me? •Operations Engineer @SlideShare since 2011 •Fan of system automation •@SylvainKalache on Twitter
  • The Cloud The cloud is now everywhere, it's elastic, dynamic and far from the rigid infrastructures what we had before. Most of the modern architecture are using "The Cloud" because give so much possibilities that no one can ignore it.
  • What about our tools? But what about our tools? Are they designed to deal with the cloud? Nagios and Ganglia, for example, are obviously not designed for. What are the alternative? Not much heh? Of course we could have use Cloudwatch because we are using EC2, but you the possibilities are still limited in term of granularity, personalization and you hit a limit at some point.
  • When these same tools does their job, something break, Nagios complains, Ganglia show the problem in its graph... And then what? Nothing, we need human intervention. Why?
  • Automation? Why humans, still do repetitive things, have to react to events that system could handle? Why we aren't we automating the management of systems?
  • What brought Skynet to life? Slideshare has an infrastructure hosted in EC2 for the document conversion. We have a lot of "factories" which are converting different types of documents. This infrastructure scale up and down on demand.
  • Old scaling script Our all scaling process was divided in 3 parts: 1. A Bash script to launch and configure instances 2. Monit and Ruby code to maintain the code running. 3. Another Ruby script to shutdown the server. These 3 pieces of code were not communicating between each other and were not failure safe.
  • •Idle •Immortal •Wasting money Zombie instances Meaning that we ended up with instances not running code. That would not shutdown themselves and were doing nothing but wasting money.
  • •No metrics - blind •Wasting time fixing Mummy Ops For Ops we had issue with no visibility about what was going on with our infrastructure (we were only monitoring the SQS queue). We ended up wasting so much time investigating and then eventually fixing problem. There was also a lack of feedback for the developers working on the conversion code, are we converting faster? Better?
  • With Skynet Controller The idea was to create a Controller that would manage the whole instance life time but not only. It would scale inteligently based on the current state of the system but also based on trend that we can generate by using historical data.
  • Skynet architecture The idea of Skynet was to make something flexible so that we could use it elsewhere. It should not be architecture specifically for our problem but for any system. Why not open sourcing it at some point?
  • Collectors: Ruby daemons for collecting system metrics and a gem for application logs. Log collection: Fluentd Datastore: MongoDB Query API: Ruby + Sinatra Controller: Ruby + MCollective + Puppet + Fog Dashboard: Investigation and monitoring tools
  • Collect system metrics Collect applications logs Collectors Collect system metrics via a Ruby daemon running on each machine, we can collect any metrics via plugins. Collect application logs via a gem. Those 2 components send data to Fluentd locally.
  • Fluentd •Light •Written in Ruby •Handle failure •Plugins •Local forwarder •Aggregation & routing •Stream processing Why? What? Fluentd is in charge of collecting the log, routing and carrying them to their endpoint. It handle failure properly (failover to over node + will backup log if next hope is down) Written in Ruby, super light. The advantage is also the infrastructure, any output/input is managed via plugin that you create/customize without have to mess with the Fluentd core.
  • •Schema less •Key/value match log format •Store system metrics & application logs •Jobs metadata Why? What? MongoDB is schema less, which give us the possibility to make our data schema evolve very easily. MongoDB is fiting-ish the bill for the format 1 log entry one Mongo document. We store system metrics and informations (CPU, memory, top output, disk, Nginx active connection...) but also application logs.
  • Abstraction of the datastore Easy REST interface Keep control over requests processed Post-request computing via plugin API The abstraction layer/REST API give use the possibility to use any datastore in the backend, we are now using MongoDB but we could use Elastic search in front of it later. Also depending of the type data, we store them in different MongoDB cluster, but this is totally transparent to the data consumer. We keep control over whatʼs possible to do so that a single data consumer does not crash the whole datastore. We offer the possibility to have computing plugin to process the data right on the API server before returning them to the requester. Better to process data as near as by the source to avoid having to transport big chunk of data over network.
  • Monitor all the things Granular & global view Investigation tool Dashboard We can monitor all kind of stuff, possibilities are infinite with the combination of the Rubym daemon and the gem. We can use all these data to have an global overview of our infrastructure but also have granularity and see that at this particular second there was a CPU spike for a job on this machine converting this document. It's also a useful debugging tool.
  • Automate Scale Fix Alert Controller Can use all the data collected to scale, based on current state but also trends. Can take action if a system is in a abnormal state try to fix it via possibility trees. And finally Skynet Controller can also alert, in case the system reach its limit and finally need a human, again...!
  • Want to read more about it? ! Check out the blog post! ! http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/
  • Thank you! Let Skynet take over the world to make it a better place. @SylvainKalache If you are interested in Skynet please donʼt hesitate to reach me!