Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High Availability Architecture and NetflixOSS


Published on

Slides presented at JPL February 2013. An updated version of the Re:Invent slides with NetflixOSS description. Black background for a change to help projector contrast.

Published in: Technology
  • would you please enable the download ? thanks
    it is really slow to view it online from China, since the GFW blocks slideshare, and i have to connect to vpn
    Are you sure you want to  Yes  No
    Your message goes here

High Availability Architecture and NetflixOSS

  1. Highly Available Architecture at Netflix JPL - February 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS
  2. Netflix Inc.Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series.Source:
  3. Abstract• Highly Available Architecture• Taxonomy of Outage Failure Modes• Real World Effects and Mitigation• Architecture Components from @NetflixOSS
  4. Blah Blah Blah (I’m skipping all the cloud intro etc. Netflix runs in the cloud, if you hadn’t figured that out already you aren’t paying attention and should read
  5. Things we don’t do
  6. Things We Do Do… In production at Netflix• Big Data/Hadoop 2009• AWS Cloud 2009• Application Performance Management 2010• Integrated DevOps Practices 2010• Continuous Integration/Delivery 2010• NoSQL, Globally Distributed 2010• Platform as a Service; Micro-Services 2010• Social coding, open development/github 2011
  7. How Netflix Streaming WorksConsumerElectronics User Data Browse Web Site orAWS Cloud Discovery API Services PersonalizationCDN EdgeLocations DRM Customer Device Play Streaming API (PC, PS3, TV…) QoS Logging CDN Management and Steering OpenConnect Watch CDN Boxes Content Encoding
  8. Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics)Each icon isthree to a fewhundredinstancesacross three CassandraAWS zones memcached Web service Start Here S3 bucketPersonalization moviegroup chooser
  9. Component Micro-Services Test With Chaos Monkey, Latency Monkey
  10. Three Balanced Availability Zones Test with Chaos Gorilla Load Balancers Zone A Zone B Zone CCassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  11. Triple Replicated Persistence Cassandra maintenance affects individual replicas Load Balancers Zone A Zone B Zone CCassandra and Evcache Cassandra and Evcache Cassandra and Evcache Replicas Replicas Replicas
  12. Isolated Regions US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas
  13. Failure Modes and EffectsFailure Mode Probability Current Mitigation PlanApplication Failure High Automatic degraded responseAWS Region Failure Low Wait for region to recoverAWS Zone Failure Medium Continue to run on 2 out of 3 zonesDatacenter Failure Medium Migrate more functions to cloudData store failure Low Restore from S3 backupsS3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
  14. Application Resilience Run what you wrote Rapid detection Rapid Response
  15. Run What You Wrote• Make developers responsible for failures – Then they learn and write code that doesn’t fail• Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”• Keep timeouts short, fail fast – Don’t let cascading timeouts stack up• Dynamic configuration options - Archaius –
  16. Resilient Design – Hystrix, RxJava
  17. Chaos Monkey• Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency• Test to make sure systems are resilient – Kill individual instances without customer impact• Latency Monkey (coming soon) – Inject extra latency and error return codes
  18. Edda – Configuration History Eureka Services metadata AWS Instances, AppDynamics ASGs, etc. Request flow Edda Monkeys
  19. Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=;_since=0" ["i-0123456789","i-012345678a","i-012345678b”]Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "", "",+ "",- ""… }
  20. Distributed Operational Model• Developers – Provision and run their own code in production – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs• DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automation instead – PaaS puts more responsibility on Dev, with tools
  21. Rapid Rollback• Use a new Autoscale Group to push code• Leave existing ASG in place, switch traffic• If OK, auto-delete old ASG a few hours later• If “whoops”, switch traffic back in seconds
  22. Asgard
  23. Platform Outage TaxonomyClassify and name the different types of things that can go wrong
  24. YOLO
  25. Zone Failure Modes• Power Outage – Instances lost, ephemeral state lost – Clean break and recovery, fail fast, “no route to host”• Network Outage – Instances isolated, state inconsistent – More complex symptoms, recovery issues, transients• Dependent Service Outage – Cascading failures, misbehaving instances, human errors – Confusing symptoms, recovery issues, byzantine effects
  26. Zone Power Failure• June 29, 2012 AWS US-East - The Big Storm – –• Highlights – One of 10+ US-East datacenters failed generator startup – UPS depleted -> 10min power outage for 7% of instances• Result – Netflix lost power to most of a zone, evacuated the zone – Small/brief user impact due to errors and retries
  27. Zone Failure Modes Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Zone DependentZone Power Outage Service Outage
  28. Regional Failure Modes• Network Failure Takes Region Offline – DNS configuration errors – Bugs and configuration errors in routers – Network capacity overload• Control Plane Overload Affecting Entire Region – Consequence of other outages – Lose control of remaining zones infrastructure – Cascading service failure, hard to diagnose
  29. Regional Control Plane Overload• April 2011 – “The big EBS Outage” – – Human error during network upgrade triggered cascading failure – Zone level failure, with brief regional control plane overload• Netflix Infrastructure Impact – Instances in one zone hung and could not launch replacements – Overload prevented other zones from launching instances – Some MySQL slaves offline for a few days• Netflix Customer Visible Impact – Higher latencies for a short time – Higher error rates for a short time – Outage was at a low traffic level time, so no capacity issues
  30. Regional Failure Modes Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Control Plane Overload
  31. Dependent Services Failure• June 29, 2012 AWS US-East - The Big Storm – Power failure recovery overloaded EBS storage service – Backlog of instance startups using EBS root volumes• ELB (Load Balancer) Impacted – ELB instances couldn’t scale because EBS was backlogged – ELB control plane also became backlogged• Mitigation Plans Mentioned – Multiple control plane request queues to isolate backlog – Rapid DNS based traffic shifting between zones
  32. Application Routing Failure June 29, 2012 AWS US-East - The Big Storm Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Effect: higher latency and errorsZone Power Outage Mitigation: Fixed config, and made Applications not using zone aware routing the default Zone-aware routing kept trying to talk to dead instances and timing out
  33. Dec 24th 2012 Partial Regional ELB Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas• ELB (Load Balancer) Impacted – ELB control plane database state accidentally corrupted – Hours to detect, hours to restore from backups• Mitigation Plans Mentioned – Tighter process for access to control plane – Better zone isolation
  34. Global Failure Modes• Software Bugs – Externally triggered (e.g. leap year/leap second) – Memory leaks and other delayed action failures• Global configuration errors – Usually human error – Both infrastructure and application level• Cascading capacity overload – Customers migrating away from a failure – Lack of cross region service isolation
  35. Global Software Bug Outages• AWS S3 Global Outage in 2008 – Gossip protocol propagated errors worldwide – No data loss, but service offline for up to 9hrs – Extra error detection fixes, no big issues since• Microsoft Azure Leap Day Outage in 2012 – Bug failed to generate certificates ending 2/29/13 – Failure to launch new instances for up to 13hrs – One line code fix.• Netflix Configuration Error in 2012 – Global property updated to broken value – Streaming stopped worldwide for ~1hr until we changed back – Fix planned to keep history of properties for quick rollback
  36. Global Failure Modes Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra ReplicasCapacity Demand Migrates Software Bugs and Global Configuration Errors “Oops…”
  37. Managing Multi-Region Availability AWS DynECT Route53 UltraDNS DNS Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone CCassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas What we need is a portable way to manage multiple DNS providers….
  38. Denominator “The next version is more portable…” for DNS Edda, Multi- Use Cases Region Failover Common Model DenominatorDNS Vendor Plug-in AWS Route53 DynECT UltraDNS Etc…API Models (varied IAM Key Auth User/pwd User/pwdand mostly broken) REST REST SOAP Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
  39. Highly Available StorageA highly scalable, available and durable deployment pattern
  40. Micro-Service Pattern One keyspace, replaces a single table or materialized view Single function Cassandra Many Different Single-Function REST Clients Cluster Managed by Priam Between 6 and 72 nodes Stateless Data Access REST Service Astyanax Cassandra Client OptionalEach icon represents a horizontally scaled service of three to Datacenterhundreds of instances deployed over three availability zones Update Flow Appdynamics Service Flow Visualization
  41. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)Optional Apache frontend,memcached, non-java apps Java (JDK 6 or 7) AppDynamics appagent monitoring Tomcat Monitoring Application war file, base servlet, platform, client Healthcheck, status servlets, JMX interface, Servo Log rotation to S3 interface jars, Astyanax autoscaleAppDynamics machineagent GC and thread dump logging Epic/Atlas
  42. Astyanax Available at• Features – Complete abstraction of connection pool from RPC protocol – Fluent Style API – Operation retry with backoff – Token aware• Recipes – Distributed row lock (without zookeeper) – Multi-DC row lock – Uniqueness constraint – Multi-row uniqueness constraint – Chunked and multi-threaded large file storage
  43. Astyanax Query ExamplePaginate through all columns in a rowColumnList<String> columns;int pageize = 10;try { RowQuery<String, String> query = keyspace .prepareQuery(CF_STANDARD1) .getKey("A") .setIsPaginating() .withColumnRange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getResult()).isEmpty()) { for (Column<String> c : columns) { } }} catch (ConnectionException e) {}
  44. Astyanax - Cassandra Write Data Flows Single Region, Multiple Availability Zone, Token Aware Cassandra •Disks •Zone A1. Client Writes to local Cassandra 3 2Cassandra If a node goes coordinator •Disks4 3•Disks 4 offline, hinted handoff2. Coodinator writes to •Zone C 1 •Zone B completes the write 2 other zones Token when the node comes3. Nodes return ack back up.4. Data written to Aware internal commit log Clients Requests can choose to disks (no more than Cassandra Cassandra wait for one node, a 10 seconds later) •Disks •Disks quorum, or all nodes to •Zone B •Zone C ack the write Cassandra 3 SSTable disk writes and •Disks 4 compactions occur •Zone A asynchronously
  45. Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum1. Client writes to local replicas If a node or region goes offline, hinted handoff2. Local write acks returned to completes the write when the node comes back up. Client which continues when Nightly global compare and repair jobs ensure 2 of 3 local nodes are everything stays consistent. committed3. Local coordinator writes to remote coordinator. 100+ms latency Cassandra Cassandra4. When data arrives, remote • Disks • Zone A • Disks • Zone A coordinator node acks and Cassandra 2 2 Cassandra Cassandra 4Cassandra 6 • Disks • Disks 6 3 5• Disks6 4 Disks6 copies to other remote zones • Zone C 1 • Zone B • Zone C • • Zone B 45. Remote nodes ack to local US EU coordinator Clients Clients Cassandra 2 Cassandra Cassandra Cassandra6. Data flushed to internal • Disks • Zone B • Disks • Zone C 6 • Disks • Zone B • Disks • Zone C commit log disks (no more Cassandra 5 6Cassandra • Disks than 10 seconds later) • Zone A • Disks • Zone A
  46. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat and Priam on JDK Java (JDK 7)Healthcheck, Status AppDynamics appagent monitoring Cassandra Server Monitoring Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log and AppDynamics GC and thread dump SSTables machineagent logging Epic/Atlas
  47. Priam – Cassandra Automation Available at• Netflix Platform Tomcat Code• Zero touch auto-configuration• State management for Cassandra JVM• Token allocation and assignment• Broken node auto-replacement• Full and incremental backup to S3• Restore sequencing from S3• Grow/Shrink Cassandra “ring”
  48. ETL for Cassandra• Data is de-normalized over many clusters!• Too many to restore from backups for ETL• Solution – read backup files using Hadoop• Aegisthus – – High throughput raw SSTable processing – Re-normalizes many clusters to a consistent view – Extract, Transform, then Load into Teradata
  49. Build Your Own Highly Available Platform @NetflixOSS
  50. Assembling the Puzzle
  51. 2013 Roadmap - highlights • BakeriesBuild and Deploy • Workflow orchestration • Push-button Launcher • Sample applications Recipes • Karyon - Base server • More Monkeys, Error and Latency Injection Availability • Denominator (more later…) • Atlas - monitoring • Genie – Hadoop Paas Analytics • Explorers / visualization • EvCache – persister and memcached service Persistence • More Astyanax Recipes
  52. Recipes & Launcher
  53. Three Questions Why is Netflix doing this?How does it all fit together? What is coming next?
  54. Netflix Deconstructed Content as a Service on a PlatformLong term strategic Easy to usebarriers to competition Personalized Service Exclusive Agile, Reliable and Scalable, Secure Extensive Low cost, Global Content Platform Enables the business, but doesn’t Netflix differentiate against large competitors
  55. Platform Evolution 2009-2010 2011-2012 2013-2014Bleeding Edge Common Shared Innovation Pattern Pattern Netflix ended up several years ahead of the industry, but it’s not a sustainable position
  56. Making it easy to followExploring the wild west each time vs. laying down a shared route
  57. Establish our Hire, Retain and solutions as Best Engage TopPractices / Standards Engineers Goals Build up Netflix Benefit from a Technology Brand shared ecosystem
  58. Progress during 2012From pushing the platform uphill to runaway success
  59. How does it all fit together?
  60. Our Current Catalog of ReleasesFree code available at
  61. Open Source Projects Legend Github / Techblog Priam Exhibitor Servo and Autoscaling ScriptsApache Contributions Cassandra as a Service Zookeeper as a Service Astyanax Curator Genie Techblog Post Cassandra client for Java Zookeeper Patterns Hadoop PaaS Coming Soon CassJMeter EVCache Hystrix Cassandra test suite Memcached as a Service Robust service pattern Cassandra Eureka / Discovery Multi-region EC2 datastore RxJava Reactive Patterns support Service Directory Asgard Aegisthus Archaius AutoScaleGroup based AWS Hadoop ETL for Cassandra Dynamics Properties Service console Edda Chaos Monkey Explorers Config state with history Robustness verification Governator Denominator Library lifecycle and dependency Latency Monkey injection (Announce today) Odin Ribbon Janitor Monkey Orchestration for Asgard REST Client + mid-tier LB Karyon Blitz4j Async logging Bakeries and AMI Instrumented REST Base Server
  62. NetflixOSS Continuous Build and Deployment Github Maven AWS NetflixOSS Central Base AMI Source Dynaslave Jenkins AWS AWS Build Bakery Baked AMIs Slaves Odin Asgard AWS Orchestration (+ Frigga) Account API Console
  63. NetflixOSS Services ScopeAWS AccountAsgard ConsoleArchaius Config Multiple AWS Regions Service Cross region Priam C* Eureka Registry Explorers Dashboards Exhibitor ZK 3 AWS Zones Application Priam Evcache Atlas Edda History Clusters Cassandra Memcached Monitoring Autoscale Groups Persistent Storage Ephemeral Storage Instances Simian ArmyGenie Hadoop Services
  64. NetflixOSS Instance Libraries • Baked AMI – Tomcat, Apache, your codeInitialization • Governator – Guice based dependency injection • Archaius – dynamic configuration properties client • Eureka - service registration client Service • Karyon - Base Server for inbound requests • RxJava – Reactive pattern • Hystrix/Turbine – dependencies and real-time status Requests • Ribbon - REST Client for outbound calls • Astyanax – Cassandra client and pattern libraryData Access • Evcache – Zone aware Memcached client • Curator – Zookeeper patterns • Blitz4j – non-blocking logging Logging • Servo – metrics export for autoscaling • Atlas – high volume instrumentation
  65. NetflixOSS Testing and Automation Test Tools • CassJmeter – load testing for C* • Circus Monkey – test rebalancing • Janitor MonkeyMaintenance • Efficiency Monkey • Doctor Monkey • Howler Monkey • Chaos Monkey - InstancesAvailability • Chaos Gorilla – Availability Zones • Chaos Kong - Regions • Latency Monkey – latency and error injection Security • Security Monkey • Conformity Monkey
  66. What’s Coming Next? Better portability Higher availability MoreFeatures Easier to deploy Contributions from end users Contributions from vendors More Use Cases
  67. Functionality and scale now, portability coming Moving from parts to a platform in 2013 Netflix is fostering an ecosystem Rapid Evolution - Low MTBIAMSH (Mean Time Between Idea And Making Stuff Happen)
  68. TakeawayNetflix has built and deployed a scalable global and highly available Platform as a Service. We encourage you to adopt or extend the NetflixOSS platform ecosystem. @adrianco #netflixcloud @NetflixOSS