• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012
 

RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012

on

  • 2,775 views

In this session, learn how Netflix has embraced DevOps and leveraged all that Amazon has to offer to allow our developers maximum freedom and agility.

In this session, learn how Netflix has embraced DevOps and leveraged all that Amazon has to offer to allow our developers maximum freedom and agility.

Statistics

Views

Total Views
2,775
Views on SlideShare
2,666
Embed Views
109

Actions

Likes
5
Downloads
0
Comments
0

2 Embeds 109

http://bennytan.net 108
http://pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012 RMG202 Rainmakers: How Netflix Operates Clouds for Maximum Freedom and Agility - AWS re: Invent 2012 Presentation Transcript

    • RainmakersHow Netflix Operates Clouds for Maximum Freedom and Agility Jeremy Edberg Reliability Architect, Netflix
    • Do you have... • A release Engineer? • A QA department? • Chef or Puppet to manage your systems?Tweet @jedberg with feedback!
    • Do you have... • Upwards of 100 releases a day?Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • With more than 30 million streaming members in the United States, Canada,Latin America, the United Kingdom, Ireland and the Nordics, Netflix is the worlds leading internet subscription service for enjoying movies and TV programs streamed over the internet to PCs, Macs and TV. Source: http://ir.netflix.com Tweet @jedberg with feedback!
    • The Netflix Way • Everything is “built for three” • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment • Independent teams responsible for both Dev and OpsTweet @jedberg with feedback!
    • PhilosophyTweet @jedberg with feedback!
    • Automate all the things!Tweet @jedberg with feedback!
    • Automate all the things! • Application startup • Configuration • Code deployment • System deploymentTweet @jedberg with feedback!
    • Automation • Standard base image • Tools to manage all the systems • Automated code deploymentTweet @jedberg with feedback!
    • Shared state should be stored in a shared service Data on an instance should be replicated to other instancesTweet @jedberg with feedback!
    • “Build for Three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment.Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Netflix on AWS 2012 2012 2012 IPv6 IPv6 IPv6 Open ConnectTweet @jedberg with feedback!
    • Highly aligned, loosely coupled • Services are built by different teams who work together to figure out what each service will provide. • The service owner publishes an API that anyone can use.Tweet @jedberg with feedback!
    • Advantages to a Service Oriented Architecture • Easier auto-scaling • Easier capacity planning • Identify problematic code-paths more easily • Narrow in the effects of a change • More efficient local cachingTweet @jedberg with feedback!
    • Freedom and Responsibility • Developers deploy when they want • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am!Tweet @jedberg with feedback!
    • All systems choices assume some part will fail at some point.Tweet @jedberg with feedback!
    • The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @jedberg with feedback!
    • Execution Photo from I, Robot, copyright 20th Century FoxTweet @jedberg with feedback!
    • Netflix built a global PaaS • Service Oriented Architecture • HTTP/Rest interfaces between servicesTweet @jedberg with feedback!
    • Netflix PaaS features • Supports all regions and zones • Multiple accounts • Cross region/account replication • Internationalized, localized and GeoIP routed • Advanced key management • Autoscaling with 1000s of instances • Monitoring and alerting on millions of metricsTweet @jedberg with feedback!
    • What AWS Provides • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling groups • Availability zones and regionsTweet @jedberg with feedback!
    • Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent loggingTweet @jedberg with feedback!
    • The Netflix Platform Discovery Circut Breakers (Eureka)Entrypoints (Hystrix) (Edda)Configuration Cassandra (Priam & (Archaius) Astyanax & Zookeeper (Exhibitor) CassJMeter) Cryptex logging (Blitz4j & Honu) AKMSEvCache NIWS Proxiesi18n Geo L10n Base Open SourceTweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Open Source at Netflix Governator Blitz4j EddaTweet @jedberg with feedback! Hystrix
    • Finding things • Discovery (Eureka) • Application to instance mapping • Heartbeat to keep track of health • Entrypoints (Edda) • Local database of AWS resources • NIWS (Netflix Internal Web Service) • On instance software load balancer • Handles retry logic • Geo (Geolocation library) • Provides IP to Lat/Lon mapping for any service that needs it.Tweet @jedberg with feedback!
    • Entrypoints (Edda) • REST API • GET /REST/v2/instance/$id • Keeps track of all resources • Autoscaling groups, EIPs, Instances, Applications, Clusters, HistoryTweet @jedberg with feedback!
    • Entrypoints Exploration Find all active GET /REST/v2/view/instances instances Find all instances in a GET /REST/v2/group/clusters cluster /v2/aws/autoScalingGroups/edda- Show only ASG name, v123;_pp:(autoScalingGroupName,instances:( instance ID and health instanceId,lifecycleState)) Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i -96f3ca3a particular instance?Tweet @jedberg with feedback!
    • Keeping it all Straight • Configuration (Archaius) • Global variables (Fast properties) • Base • Base system. Prod vs. Test, etc • Zookeeper (Curator) • Locks, other similar coordination • Logging (Blitz4j and Honu) • Keep track of what happened and store it for post analysis.Tweet @jedberg with feedback!
    • Keeping it Secure • Cryptex • Service for key management • High, medium and low value keys • AKMS (Amazon Key Management System) • Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instanceTweet @jedberg with feedback! For more info, see SEC201: Security Panel
    • Storing it• Cassandra (Priam, astyanax) • Configure and access Cassandra • Provide OO abstractions handle connection pooling, discovery of hosts• EVCache (Eccentric Volatile Cache) • Wrapper for memcached to handle zone awareness and replication• Proxies • Get data out of the datacenter and into the cloud.Tweet @jedberg with feedback!
    • Data What do we do with it all?Tweet @jedberg with feedback!
    • We store it! • Cache (memcached) • Cassandra • RDS (MySql)Tweet @jedberg with feedback!
    • CassandraTweet @jedberg with feedback!
    • Why Cassandra? • Availability over consistency • Writes over reads • We know Java • Open source + supportTweet @jedberg with feedback!
    • Using Cassandra at Netflix • Priam • Zero touch auto-config • State management • Token assignment • Node replacement • Backup/restore to/from S3 • Astyanax • OO abstraction to Cassandra • Multi-region supportTweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Cassandra ArchitectureTweet @jedberg with feedback!
    • Cassandra ArchitectureTweet @jedberg with feedback! For more info, see DAT202: Optimizing your Cassandra Database on AWS
    • Tools • Asgard • AWS usage • Atlas • Chronos • Build system • Explorers (Cassandra and SimpleDB)Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Elastic Load Balancer Auto Scaling Group Security Instances Group Launch Configuration Amazon MachineTweet @jedberg with feedback! Image
    • api-frontend api-usprod-v007 api-usprod-v008Tweet @jedberg with feedback!
    • api-frontend api-usprod-v007 api-usprod-v008Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • Netflix has moved the granularity from the instance to the clusterTweet @jedberg with feedback!
    • Why Bake? Traditional: •launch OS •install Generic AMI Instance packages •install app Netflix: •launch OS+app App AMI InstanceTweet @jedberg with feedback!
    • Getting Baked Artifactory app bundles Ivy snapshot / release libraries libraries / appsJenkins resolve test publish sync compile build report source Perforce / Git Ant targets Groovy all over Tweet @jedberg with feedback!
    • Base Image S3 / EBS Baking foundation AMI Linux: CentOS, Fedora, Ubuntu base AMI mount snapshot Ready for Yum / Apt app install Bakery bake AWS RPMs: Apache, Java... ec2 slave instancesTweet @jedberg with feedback!
    • App Image Baking S3 / EBS base AMI Linux, Apache, Java, Tomcat app AMI mount snapshot Jenkins / Yum / Ready Artifactory to launch! install Bakery AWS app bundle ec2 slave instancesTweet @jedberg with feedback!
    • Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent loggingTweet @jedberg with feedback!
    • Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring JBoss Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, servelets, JMX interface, Appdynamics thread interface jars for Servo autoscale Machine dump dependent services Agent loggingTweet @jedberg with feedback!
    • Linux Base AMI (CentOS or Ubuntu) Optional Python Apache monitoring Monitoring Django Log Rotation to S3 Application file, base server, platform, Appdynamics interface libs for logging Machine dependent services AgentTweet @jedberg with feedback!
    • The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @jedberg with feedback!
    • • The simian army Chaos -- Kills random instances • Chaos Gorilla -- Kills zones • Chaos Kong -- Kills regions • Latency -- Degrades network and injects faults • Conformity -- Looks for outliers • Circus -- Kills and launches instances to maintain zone balance • Doctor -- Fixes unhealthy resources • Janitor -- Cleans up unused resources • Howler -- Yells about bad things like Amazon limit violations • Security -- Finds security issues and expiring certificatesTweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
    • What’s going on?!Tweet @jedberg with feedback!
    • AtlasTweet @jedberg with feedback!
    • { "clusters": [ "epic_aggregator", "epic_aggregator-dev" ], { "alerts": [ "metricName": "EpicPlugin_MetricCount", // you can use javascript style comments in the config "applyTo": "instance", { "description": "${instanceId} is reporting too many metrics", "metricName": "EpicPlugin_NumDropped", "condition": { "applyTo": "cluster", "type": "NumOccurrences", "condition": { "num": 4, "type": "StaticThreshold", "condition": { "max": 0.0 "type": "StaticThreshold", }, "max": 0.0 "severity": "major", } "description": "plugin is dropping metrics" }, }, "additionalDetails": { { "statusUrl": "http://${publicDnsName}:7001/Status", "metricName": "EpicPlugin_NumDropped_Instance", "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}" "applyTo": "instance", } "condition": { "overrides": { "type": "NumOccurrences", "subject": "${instanceId} is reporting too many metrics", "num": 4, "incident_key": "${metricName}:${instanceId}", "condition": { "service_key_override": "12345", "type": "StaticThreshold", "email_override": "devnull@netflix.com" "max": 0.0 }, } "severity": "minor" }, } "overrides": { ] "service_key_override": "12345", } "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"], "email_override": "devnull@netflix.com" }, "severity": "minor" }, Example Alert Config Tweet @jedberg with feedback!
    • Alert TuningTweet @jedberg with feedback!
    • Alert Systems CORE Event Paging Atlas Gatewa Service alerting y CORE Appdynamics Agent Amazon SES api CORE Agent api Other Team’s AgentTweet @jedberg with feedback!
    • Tweet @jedberg with feedback!
    • ChronosTweet @jedberg with feedback!
    • Data Collection Pipeline Data Processing Pipeline TextTweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce
    • Chuckwa/Honu messages / min 63 billion messages a dayTweet @jedberg with feedback!
    • Best PracticesTweet @jedberg with feedback!
    • Incident Reviews Ask the key questions: • What went wrong? • How could we have detected it sooner? • How could we have prevented it? • How can we prevent this class of problem in the future? • How can we improve our behavior for next time?Tweet @jedberg with feedback!
    • Best Practices for Data • Have multiple copies of all data • Keep those copies in multiple AZs • Avoid keeping state on a single instance • Take frequent snapshots of EBS disks • No secret keys on the instanceTweet @jedberg with feedback!
    • Netflix autoscaling 2 Deployment Text 1 Traffic PeakTweet @jedberg with feedback!
    • AWS Usage Dollar amounts have been carefully removedTweet @jedberg with feedback!
    • Going multi-zoneTweet @jedberg with feedback!
    • Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per regionTweet @jedberg with feedback!
    • Going Multi-regionTweet @jedberg with feedback!
    • Leveraging Multi-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost moneyTweet @jedberg with feedback!
    • Circuit Breakers (Hystrix) Be liberal in what you accept, strict in what you sendTweet @jedberg with feedback!
    • Just a quick reminder... • (Some of) Netflix is open source: • https://github.com/netflixTweet @jedberg with feedback!
    • We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.
    • Questions?Tweet @jedberg with feedback!
    • Getting in touch Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedbergTweet @jedberg with feedback!