Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)


Published on

A collection of information taken from previous presentations that was used as drill down for supporting discussion of specific topics during the tutorial.

Published in: Technology, Health & Medicine

Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)

  1. 1. Details:Building Using The NetflixOSSArchitectureMay 2013Adrian Cockcroft@adrianco #netflixcloud @NetflixOSS
  2. 2. Architectures for High AvailabilityCassandra Storage and ReplicationNetflixOSS Components
  3. 3. Component Micro-ServicesTest With Chaos Monkey, Latency Monkey
  4. 4. Three Balanced Availability ZonesTest with Chaos GorillaCassandra and EvcacheReplicasZone ACassandra and EvcacheReplicasZone BCassandra and EvcacheReplicasZone CLoad Balancers
  5. 5. Triple Replicated PersistenceCassandra maintenance affects individual replicasCassandra and EvcacheReplicasZone ACassandra and EvcacheReplicasZone BCassandra and EvcacheReplicasZone CLoad Balancers
  6. 6. Isolated RegionsCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load Balancers
  7. 7. Failure Modes and EffectsFailure Mode Probability Current Mitigation PlanApplication Failure High Automatic degraded responseAWS Region Failure Low Active-Active Using DenominatorAWS Zone Failure Medium Continue to run on 2 out of 3 zonesDatacenter Failure Medium Migrate more functions to cloudData store failure Low Restore from S3 backupsS3 failure Low Restore from remote archiveUntil we got really good at mitigating high and mediumprobability failures, the ROI for mitigating regional failuresdidn’t make sense. Working on Active-Active in 2013.
  8. 8. Application ResilienceRun what you wroteRapid detectionRapid Response
  9. 9. Run What You Wrote• Make developers responsible for failures– Then they learn and write code that doesn’t fail• Use Incident Reviews to find gaps to fix– Make sure its not about finding “who to blame”• Keep timeouts short, fail fast– Don’t let cascading timeouts stack up• Dynamic configuration options - Archaius–
  10. 10. Resilient Design – Hystrix, RxJava
  11. 11. Chaos Monkey• Computers (Datacenter or AWS) randomly die– Fact of life, but too infrequent to test resiliency• Test to make sure systems are resilient– Kill individual instances without customer impact• Latency Monkey (coming soon)– Inject extra latency and error return codes
  12. 12. MonkeysEdda – Configuration History,ASGs, etc.EurekaServicesmetadataAppDynamicsRequest flow
  13. 13. Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=;_since=0"["i-0123456789","i-012345678a","i-012345678b”]Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@{…"ipRanges" : ["","",+ "",- ""…}
  14. 14. Platform Outage TaxonomyClassify and name the different typesof things that can go wrong
  15. 15. YOLO
  16. 16. Zone Failure Modes• Power Outage– Instances lost, ephemeral state lost– Clean break and recovery, fail fast, “no route to host”• Network Outage– Instances isolated, state inconsistent– More complex symptoms, recovery issues, transients• Dependent Service Outage– Cascading failures, misbehaving instances, human errors– Confusing symptoms, recovery issues, byzantine effects
  17. 17. Zone Power Failure• June 29, 2012 AWS US-East - The Big Storm––• Highlights– One of 10+ US-East datacenters failed generator startup– UPS depleted -> 10min power outage for 7% of instances• Result– Netflix lost power to most of a zone, evacuated the zone– Small/brief user impact due to errors and retries
  18. 18. Zone Failure ModesCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load BalancersZone Power OutageZone Network OutageZone DependentService Outage
  19. 19. Regional Failure Modes• Network Failure Takes Region Offline– DNS configuration errors– Bugs and configuration errors in routers– Network capacity overload• Control Plane Overload Affecting Entire Region– Consequence of other outages– Lose control of remaining zones infrastructure– Cascading service failure, hard to diagnose
  20. 20. Regional Control Plane Overload• April 2011 – “The big EBS Outage”–– Human error during network upgrade triggered cascading failure– Zone level failure, with brief regional control plane overload• Netflix Infrastructure Impact– Instances in one zone hung and could not launch replacements– Overload prevented other zones from launching instances– Some MySQL slaves offline for a few days• Netflix Customer Visible Impact– Higher latencies for a short time– Higher error rates for a short time– Outage was at a low traffic level time, so no capacity issues
  21. 21. Regional Failure ModesCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load BalancersRegional Network OutageControl Plane Overload
  22. 22. Dependent Services Failure• June 29, 2012 AWS US-East - The Big Storm– Power failure recovery overloaded EBS storage service– Backlog of instance startups using EBS root volumes• ELB (Load Balancer) Impacted– ELB instances couldn’t scale because EBS was backlogged– ELB control plane also became backlogged• Mitigation Plans Mentioned– Multiple control plane request queues to isolate backlog– Rapid DNS based traffic shifting between zones
  23. 23. Application Routing FailureJune 29, 2012 AWS US-East - The Big StormCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load BalancersZone Power OutageApplications not usingZone-aware routing kepttrying to talk to deadinstances and timing outEureka service directory failed to mark downdead instances due to a configuration errorEffect: higher latency and errorsMitigation: Fixed config, and madezone aware routing the default
  24. 24. Dec 24th 2012Cassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load BalancersPartial Regional ELB Outage• ELB (Load Balancer) Impacted– ELB control plane database state accidentally corrupted– Hours to detect, hours to restore from backups• Mitigation Plans Mentioned– Tighter process for access to control plane– Better zone isolation
  25. 25. Global Failure Modes• Software Bugs– Externally triggered (e.g. leap year/leap second)– Memory leaks and other delayed action failures• Global configuration errors– Usually human error– Both infrastructure and application level• Cascading capacity overload– Customers migrating away from a failure– Lack of cross region service isolation
  26. 26. Global Software Bug Outages• AWS S3 Global Outage in 2008– Gossip protocol propagated errors worldwide– No data loss, but service offline for up to 9hrs– Extra error detection fixes, no big issues since• Microsoft Azure Leap Day Outage in 2012– Bug failed to generate certificates ending 2/29/13– Failure to launch new instances for up to 13hrs– One line code fix.• Netflix Configuration Error in 2012– Global property updated to broken value– Streaming stopped worldwide for ~1hr until we changed back– Fix planned to keep history of properties for quick rollback
  27. 27. Global Failure ModesCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CUS-East Load BalancersCassandra ReplicasZone ACassandra ReplicasZone BCassandra ReplicasZone CEU-West Load BalancersCascading Capacity OverloadSoftware Bugs and GlobalConfiguration ErrorsCapacity Demand Migrates“Oops…”
  28. 28. DenominatorPortable DNS managementAPI Models (variedand mostly broken)DNS Vendor Plug-inCommon ModelUse CasesEdda, Multi-RegionFailoverDenominatorAWS Route53IAM Key AuthRESTDynECTUser/pwdRESTUltraDNSUser/pwdSOAPEtc…Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
  29. 29. Highly Available StorageA highly scalable, available anddurable deployment pattern
  30. 30. Micro-Service PatternOne keyspace, replaces a single table or materialized viewSingle function CassandraCluster Managed by PriamBetween 6 and 72 nodesStateless Data Access REST ServiceAstyanax Cassandra ClientOptionalDatacenterUpdate FlowMany Different Single-Function REST ClientsAppdynamics Service Flow VisualizationEach icon represents a horizontally scaled service of three tohundreds of instances deployed over three availability zones
  31. 31. Stateless Micro-Service ArchitectureLinux Base AMI (CentOS or Ubuntu)OptionalApachefrontend,memcached,non-java appsMonitoringLog rotationto S3AppDynamicsmachineagentEpic/AtlasJava (JDK 6 or 7)AppDynamicsappagentmonitoringGC and threaddump loggingTomcatApplication war file, baseservlet, platform, clientinterface jars, AstyanaxHealthcheck, statusservlets, JMX interface,Servo autoscale
  32. 32. AstyanaxAvailable at• Features– Complete abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware• Recipes– Distributed row lock (without zookeeper)– Multi-DC row lock– Uniqueness constraint– Multi-row uniqueness constraint– Chunked and multi-threaded large file storage
  33. 33. Initializing Astyanax// Configuration either set in code or nfastyanax.propertiesplatform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERYnetflix.environment=testdefault.astyanax.readConsistency=CL_QUORUMdefault.astyanax.writeConsistency=CL_QUORUMMyCluster.MyKeyspace.astyanax.servers= Must initialize platform for discovery to workNFLibraryManager.initLibrary(PlatformManager.class, props, false, true);NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);// Open a keyspace instanceKeyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
  34. 34. Astyanax Query ExamplePaginate through all columns in a rowColumnList<String> columns;int pageize = 10;try {RowQuery<String, String> query = keyspace.prepareQuery(CF_STANDARD1).getKey("A").setIsPaginating().withColumnRange(new RangeBuilder().setMaxSize(pageize).build());while (!(columns = query.execute().getResult()).isEmpty()) {for (Column<String> c : columns) {}}} catch (ConnectionException e) {}
  35. 35. Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token AwareTokenAwareClientsCassandra•Disks•Zone ACassandra•Disks•Zone BCassandra•Disks•Zone CCassandra•Disks•Zone ACassandra•Disks•Zone BCassandra•Disks•Zone C1. Client Writes to localcoordinator2. Coodinator writes toother zones3. Nodes return ack4. Data written tointernal commit logdisks (no more than10 seconds later)If a node goes offline,hinted handoffcompletes the writewhen the node comesback up.Requests can choose towait for one node, aquorum, or all nodes toack the writeSSTable disk writes andcompactions occurasynchronously144423332
  36. 36. Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local QuorumUSClientsCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone CCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone C1. Client writes to local replicas2. Local write acks returned toClient which continues when2 of 3 local nodes arecommitted3. Local coordinator writes toremote coordinator.4. When data arrives, remotecoordinator node acks andcopies to other remote zones5. Remote nodes ack to localcoordinator6. Data flushed to internalcommit log disks (no morethan 10 seconds later)If a node or region goes offline, hinted handoffcompletes the write when the node comes back up.Nightly global compare and repair jobs ensureeverything stays consistent.EUClientsCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone CCassandra• Disks• Zone ACassandra• Disks• Zone BCassandra• Disks• Zone C6556 644416662223100+ms latency
  37. 37. Cassandra Instance ArchitectureLinux Base AMI (CentOS or Ubuntu)Tomcat andPriam on JDKHealthcheck,StatusMonitoringAppDynamicsmachineagentEpic/AtlasJava (JDK 7)AppDynamicsappagentmonitoringGC and threaddump loggingCassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB diskholding Commit log and SSTables
  38. 38. Priam – Cassandra AutomationAvailable at• Netflix Platform Tomcat Code• Zero touch auto-configuration• State management for Cassandra JVM• Token allocation and assignment• Broken node auto-replacement• Full and incremental backup to S3• Restore sequencing from S3• Grow/Shrink Cassandra “ring”
  39. 39. ETL for Cassandra• Data is de-normalized over many clusters!• Too many to restore from backups for ETL• Solution – read backup files using Hadoop• Aegisthus–– High throughput raw SSTable processing– Re-normalizes many clusters to a consistent view– Extract, Transform, then Load into Teradata
  40. 40. Cloud Architecture PatternsWhere do we start?
  41. 41. Datacenter to Cloud Transition Goals• Faster– Lower latency than the equivalent datacenter web pages and API calls– Measured as mean and 99th percentile– For both first hit (e.g. home page) and in-session hits for the same user• Scalable– Avoid needing any more datacenter capacity as subscriber count increases– No central vertically scaled databases– Leverage AWS elastic capacity effectively• Available– Substantially higher robustness and availability than datacenter services– Leverage multiple AWS availability zones– No scheduled down time, no central database schema to change• Productive– Optimize agility of a large development team with automation and tools– Leave behind complex tangled datacenter code base (~8 year old architecture)– Enforce clean layered interfaces and re-usable components
  42. 42. Datacenter Anti-PatternsWhat do we currently do in thedatacenter that prevents us frommeeting our goals?
  43. 43. Rewrite from ScratchNot everything is cloud specificPay down technical debtRobust patterns
  44. 44. Netflix Datacenter vs. Cloud ArchCentral SQL Database Distributed Key/Value NoSQLSticky In-Memory Session Shared Memcached SessionChatty Protocols Latency Tolerant ProtocolsTangled Service Interfaces Layered Service InterfacesInstrumented Code Instrumented Service PatternsFat Complex Objects Lightweight Serializable ObjectsComponents as Jar Files Components as Services
  45. 45. Tangled Service Interfaces• Datacenter implementation is exposed– Oracle SQL queries mixed into business logic• Tangled code– Deep dependencies, false sharing• Data providers with sideways dependencies– Everything depends on everything elseAnti-pattern affects productivity, availability
  46. 46. Untangled Service InterfacesTwo layers:• SAL - Service Access Library– Basic serialization and error handling– REST or POJO’s defined by data provider• ESL - Extended Service Library– Caching, conveniences, can combine several SALs– Exposes faceted type system (described later)– Interface defined by data consumer in many cases
  47. 47. Service Interaction PatternSample Swimlane Diagram
  48. 48. NetflixOSS Details• Platform entities and services• AWS Accounts and access management• Upcoming and recent NetflixOSS components• In-depth on NetflixOSS components
  49. 49. Basic Platform Entities• AWS Based Entities– Instances and Machine Images, Elastic IP Addresses– Security Groups, Load Balancers, Autoscale Groups– Availability Zones and Geographic Regions• NetflixOS Specific Entities– Applications (registered services)– Clusters (versioned Autoscale Groups for an App)– Properties (dynamic hierarchical configuration)
  50. 50. Core Platform Services• AWS Based Services– S3 storage, to 5TB files, parallel multipart writes– SQS – Simple Queue Service. Messaging layer.• Netflix Based Services– EVCache – memcached based ephemeral cache– Cassandra – distributed persistent data store
  51. 51. Security Architecture• Instance Level Security baked into base AMI– Login: ssh only allowed via portal (not between instances)– Each app type runs as its own userid app{test|prod}• AWS Security, Identity and Access Management– Each app has its own security group (firewall ports)– Fine grain user roles and resource ACLs• Key Management– AWS Keys dynamically provisioned, easy updates– High grade app specific key management support
  52. 52. AWS Accounts
  53. 53. Accounts Isolate Concerns• paastest – for development and testing– Fully functional deployment of all services– Developer tagged “stacks” for separation• paasprod – for production– Autoscale groups only, isolated instances are terminated– Alert routing, backups enabled by default• paasaudit – for sensitive services– To support SOX, PCI, etc.– Extra access controls, auditing• paasarchive – for disaster recovery– Long term archive of backups– Different region, perhaps different vendor
  54. 54. Reservations and Billing• Consolidated Billing– Combine all accounts into one bill– Pooled capacity for bigger volume discounts• Reservations– Save up to 71% on your baseline load– Priority when you request reserved capacity– Unused reservations are shared across accounts
  55. 55. Cloud Access Gateway• Datacenter or office based– A separate VM for each AWS account– Two per account for high availability– Mount NFS shared home directories for developers– Instances trust the gateway via a security group• Manage how developers login to cloud– Access control via ldap group membership– Audit logs of every login to the cloud– Similar to awsfabrictasks ssh wrapper
  56. 56. Cloud Access Controlwww-prod• Userid wwwprodDal-prod• Userid dalprodCass-prod• Userid cassprodCloud Accessssh GatewaySecurity groups don’t allowssh between instancesdevelopers
  57. 57. AWS Usage (coming soon)for test, carefully omitting any $ numbers…
  58. 58. Dashboards with Pytheas (Explorers)• Cassandra Explorer– Browse clusters, keyspaces, column families• Base Server Explorer– Browse service endpoints configuration, perf• Anything else you want to build…
  59. 59. Cassandra Explorer
  60. 60. Cassandra Explorer
  61. 61. Cassandra Clusters
  62. 62. Bubble Chart
  63. 63. Slideshare NetflixOSS Details• Lightning Talks Feb S1E1–• Asgard In Depth Feb S1E1–• Lightning Talks March S1E2–• Security Architecture–• Cost Aware Cloud Architectures – with Jinesh Varia of AWS–
  64. 64. Amazon Cloud Terminology ReferenceSee This is not a full list of Amazon Web Service features• AWS – Amazon Web Services (common name for Amazon cloud)• AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus application code)• EC2 – Elastic Compute Cloud– Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configurations.– Instance – a running computer system. Ephemeral, when it is de-allocated nothing is kept.– Reserved Instances – pre-paid to reduce cost for long term usage– Availability Zone – datacenter with own power and cooling hosting cloud instances– Region – group of Avail Zones – US-East, US-West, EU-Eire, Asia-Singapore, Asia-Japan, SA-Brazil, US-Gov• ASG – Auto Scaling Group (instances booting from the same AMI)• S3 – Simple Storage Service (http access)• EBS – Elastic Block Storage (network disk filesystem can be mounted on an instance)• RDS – Relational Database Service (managed MySQL master and slaves)• DynamoDB/SDB – Simple Data Base (hosted http based NoSQL datastore, DynamoDB replaces SDB)• SQS – Simple Queue Service (http based message queue)• SNS – Simple Notification Service (http and email based topics and messages)• EMR – Elastic Map Reduce (automatically managed Hadoop cluster)• ELB – Elastic Load Balancer• EIP – Elastic IP (stable IP address mapping assigned to instance or ELB)• VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs)• DirectConnect – secure pipe from AWS VPC to external datacenter• IAM – Identity and Access Management (fine grain role based security keys)