Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mini-Training: Netflix Simian Army


Published on

A short presentation on Netflix robots which allow them to ensure reliability and resilience of their massive distributed system.

Published in: Technology

Mini-Training: Netflix Simian Army

  1. 1. Netflix  Founded in 1997  World's leading internet subscription service for enjoying movies and TV programs  More than 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month  Watch it anytime anywhere on all your connected devices  From 2 to 60 billion requests a day to their api in 2 years  12 billion outbound requests to api dependencies  A complex distributed system 2
  2. 2. Amazon Web Services  AWS • Officially launched in 2006 • Offers a broad set of global compute, storage, database, analytics, application, and deployment services • Accessible by HTTP via REST or SOAP • Data centers localized in 8 different world regions • Has Nasa, Netflix and the CIA (AWS private replica) as customers  What it provides • Elastic Compute Cloud (EC2), resizable compute capacity in the cloud • Elastic Block Store (EBS), block level storage volumes used by EC2 instances • Elastic Load Balancing, automatic incoming application traffic distribution across multiple EC2 instances 3
  3. 3. Architecture picture 4 B C A ASG 2 B C A ASG 1 Availability Zone 2 Region A B C A ASG 2 B C A ASG 1 Availability Zone 1 B C A ASG 2 B C A ASG 1 Availability Zone 3
  4. 4. Transition to AWS  Dorothy, you’re not in Kansas anymore • Prepare to unlearn a lot of what you know • Be much more structured about “over the wire” interactions  Co-tenancy is hard • Build your system to expect and accommodate failure at any level  The best way to avoid failure is to fail constantly • Design each distributed system to expect and tolerate failure from other systems on which it depends • Constantly test your ability to succeed despite failure  Learn with real scale, not toy models • Try doing it at full scale with real data • Validate your design choices, with real scale comes trouble  Commit yourself • It is hard to start • Learn from your mistakes 5
  5. 5. The Simian Army  Availability and Resiliency as a Service  A set of tools (scheduled agents) that deliberately shuts down services, slows down performances, checks conformity, … And tests the ability to survive them • Chaos Monkey * (Chaos Gorilla and Chaos Kong) • Latency Monkey • Conformity Monkey * • Security Monkey • Doctor Monkey • Janitor Monkey * • Howler Monkey • More to come, … 6
  6. 6. The Chaos Monkey  How • Service running on AWS and seeking out Auto Scaling groups and terminating instances per group • Flexible enough design to work on other cloud providers or instances grouping an can be enhanced to support that • Has a configurable schedule, running by default on non-holiday weekdays between 9am and 3pm • Gorilla monkey simulates the outage of an entire Availability Zone • Kong Monkey simulates the outage of an entire Region  Why • Prepare to fail to ensure you can tolerate instance failure • Learn from new unpredicted issues that may occur • Check services automatic rebalance without user visual impact 7
  7. 7. The Latency Monkey  How • Service running on AWS and inducing artificial delay on the RESTful communication layer and measuring upstream services response • With large delays, a node or even an entire service downtime can be simulated without physically bringing the instances down  Why • Simulate service degradation • Test that services respond appropriately • Test the ability to survive an entire service downtime • Test the fault tolerance of a new service by simulating the failure of its dependencies without affecting the rest of the system 8
  8. 8. The Conformity Monkey  How • Service running on AWS and finding instances that don’t comply to predefined best practices • Marks non compliant instances, shuts them down and notifies corresponding owners • Check is performed every hour by default • Notification is sent only once per day at noon time  Why • Non compliant instances, like not belonging to and Auto Scaling Group are trouble waiting to happen • Anticipate and give the owners a chance to relaunch them properly 9
  9. 9. The Security Monkey  How • Service running on AWS and an extension of the Conformity Monkey • Finds security violations or vulnerabilities  Why • Track improperly configured security groups to terminate offending instances • Ensure that all their SSL and DRM certificates are valid and are not coming up for renewal 10
  10. 10. The Doctor Monkey  How • Service running on AWS and taping into health checks running on each instance • Monitors other external health signs like cpu load or memory usage • Remove unhealthy detected instances from service  Why • Give time to service owners to root cause the problem • Eventually terminate the detected instances 11
  11. 11. The Janitor Monkey  How (mark, notify, delete) • Service running on AWS and searching for unused resources • Mark resources as cleanup candidates • Schedule resources disposal time • Cleanup deadline is defined in the rule that allows to mark the resource • Notify the owners of the marked resources • Notification time is 2 business days before the cleanup deadline by default • During this period the owner can decide to cleanup or retain the resource • Dispose resources once deadline is met  Why • Ensure that the cloud environment is running free of clutter and waste. • Save costs on operations (The more and the longer you use, the more you pay) • Free up engineering time, no need to manage unused resources anymore 12
  12. 12. The Howler Monkey  How • Service running on AWS and monitoring whether a workload meets AWS possible limitations and reports it  Why • Maintain healthy operations by ensuring that AWS limitations are respected • Save costs on operations 13
  13. 13. Netflix Open Source Software  Build your own robust and highly available platform  Release PaaS components git by git to incentive others • Sources at • Intros and techniques at • Blog post or new code every few weeks  Motivations • Give back to Apache licensed OSS community • Motivate, retain, hire top engineers • "Peer pressure" code cleanup, external contributions  Users and contributors • IBM • Waze • Yahoo • Eucalyptus (Scalable cloud software) • Yammer (Private social network), … 14
  14. 14. References     3040095277/  wild.html  tools/  source 15
  15. 15. Find out more • On
  16. 16. We want our Sports betting, Poker, Horse racing and Casino & Games brands to be easy to use for every gamer around the world. Code with us to make that happen. Look at all the challenges we offer HERE We are hiring ! Check our Employer Page Follow us on LinkedIn
  17. 17. About Us • Betclic Everest Group, one of the world leaders in online gaming, has a unique portfolio comprising various complementary international brands: Betclic, Everest, Bet-at-, Expekt, Monte-Carlo Casino… • Through our brands, Betclic Everest Group places expertise, technological know-how and security at the heart of our strategy to deliver an on-line gaming offer attuned to the passion of our players. We want our brands to be easy to use for every gamer around the world. We’re building our company to make that happen. • Active in 100 countries with more than 12 million customers worldwide, the Group is committed to promoting secure and responsible gaming and is a member of several international professional associations including the EGBA (European Gaming and Betting Association) and the ESSA (European Sports Security Association).