Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Netflix Cloud Architecture and Open Source


Published on

A presentation on the Netflix Cloud Architecture and NetflixOSS open source. For the All Things Open 2015 conference in Raleigh 2015/10/19. #ATO2015 #NetflixOSS

Published in: Technology

Netflix Cloud Architecture and Open Source

  1. 1. Netflix Architecture and Open Source Andrew Spyker Senior Software Engineer, Netflix
  2. 2. About Netflix 69M members 2000+ employees (1400 tech) 80+ countries > 100M hours watch per day > ⅓ NA internet download traffic 500+ Microservices Many 10’s of thousands VM’s 3 regions across the world
  3. 3. About the Speaker Cloud platform technologies Distributed configuration, service discovery, RPC, application frameworks, non-Java sidecar Container cloud Resource management and scheduling, making Docker containers operational in Amazon EC2/ECS Open Source Organize @NetflixOSS meetups & internal group Performance Assist across Netflix, but focused mainly on cloud platform perf With Netflix for ~ 1 year. Previously at IBM here in Raleigh/Durham (RTP) @aspyker
  4. 4. Agenda NetflixOSS Netflix Cloud Architecture Getting started
  5. 5. Why does Netflix open source? Allows engineers to gather feedback Openly talk, through code, on our approach Collaboration on key projects with the world Happily use proven outside open source And improve it for Netflix scale and availability Netflix culture of freedom and responsibility Want to open source? Go for it, be responsible! Recruiting and Retention Candidates know exactly what they can work on NetflixOSS engineers choose to stay at Netflix
  6. 6. NetflixOSS is widely used The architecture has shaped public cloud usage Immutability, Red/Black Deploys, Chaos, Regional and worldwide high availability Offerings Pivotal Spring Cloud Large usage IBM Watson as a Service (on IBM Cloud) Nike Digital is hiring NetflixOSS experts Interesting usage “To help locate new troves of data claiming to be the files stolen from AshleyMadison, the company’s forensics team has been using a tool that Netflix released last year called Scumblr”
  7. 7. NetflixOSS Website Relaunch
  8. 8. Key aspects of NetflixOSS website Show how the pieces fit together Projects now discussed with each other in context OSS categories mirror internal teams No artificial categories, focal points for each area Focus on projects that are core to Netflix Projects mentioned are core and strategic
  9. 9. Agenda NetflixOSS Netflix Cloud Architecture Getting Started
  10. 10. Elastic, Web and Hyper Scale Doing this Not doing that
  11. 11. Elastic, Web and Hyper Scale Front end API Another Microservice Temporal caching Durable Storage Load Balancers Strategy Benefit Make deployments automated Without automation impossible Expose well designed API to users Offloads presentation complexity to clients Remove state for mid tier services Allows easy elastic scale out Push temporal state to client and caching tier Leverage clients, avoids data tier overload Use partitioned data storage Data design and storage scales with HA Recommendation Microservice
  12. 12. HA and Automatic Recovery Feeling This Not Feeling That
  13. 13. Micro service Implementation Call microservice #2 Highly Available Service Runtime Recipe Ribbon REST client with Eureka Microservice #1 (REST services) App Service Microservice #2 Execute call Hystrix Eureka Server(s) Eureka Server(s) Eureka Server(s) Karyon Fallback Implementation Implementation Detail Benefits Decompose into micro services • Key user path always available • Failure does not propagate across service boundaries Karyon /w automatic Eureka registration • New instances are quickly found • Failing individual instances disappear Ribbon client with Eureka awareness • Load balances & retries across instances with “smarts” • Handles temporal instance failure Hystrix as dependency circuit breaker • Allows for fast failure • Provides graceful cross service degradation/recovery
  14. 14. IaaS High Availability Region (us-east-1) us-east-1e us-east-1c Eureka Web App Service1 Service2 Cluster Auto Recovery and Scaling Services (Auto Scaling Groups) ELB’s Rule Why? Always > 2 of everything 1 is SPOF, 2 doesn’t web scale and slow DR recovery Including IaaS and cloud services You’re only as strong as your weakest dependency Use auto scaler/recovery monitoring Clusters guarantee availability and service latency Use application level health checks Instance on the network != healthy Worldwide availability Data replication, global front-end routing, cross region traffic us-east-1d
  15. 15. A truly global service Replicate data across regions Be able to redirect traffic from region to region Be able to migrate regional traffic to other regions Have automated control across regions Flux Demo
  16. 16. Testing is only way to prove HA Chaos Monkey Kill instances in production - runs regularly Chaos Gorilla Kills availability zones (single datacenter) Also testing for split brain important Chaos Kong Kill entire region and shift traffic globally Run frequently but with prior scheduling
  17. 17. Continuous Delivery Reading This Not This
  18. 18. v Continuous Delivery Cluster v1 Canary v2 Cluster V2 Step Technology Developers test locally Unit test frameworks Continuous build Continuous build server based on gradle builds Build “bakes” full instance image Aminator and deployment pipeline bake images from build artifacts Developer work across dev and test Archaius allows for environment based context Developers do canary tests, red/black deployments in prod Asgard console provides app cluster common devops approach, security patterns, and visibility Continuous Build Server Baked to images (AMI’s)
  19. 19. From Asgard to Spinnaker Spinnaker is our CI/CD solution CI/CD solution including baking and Jenkins integration Workflow engine for the continuous delivery Pipeline based deployment including baking Global visibility across all of our AWS regions Provides an API first design A microservices runtime HA architecture More flexible cloud model so the community can contribute back improvements not related to AWS Asgard continues to work side-by-side Spinnaker is this new end to end CI/CD tool
  20. 20. Spinnaker Examples Works at Netflix scale Views of global pipelines From simple Asgard like deployment to advanced CI/CD pipelines
  21. 21. Operational Visibility If you can’t see it, you can’t improve it
  22. 22. Operational Visibility Microservice #1 Microservice #2 Visibility Point Technology Basic IaaS instance monitoring Not enough (not scalable, not app specific) User like external monitoring SaaS offerings or OSS like Uptime Targeted performance, sampling Vector performance and app level metrics Service to service interconnects Hystrix streams ➔Turbine aggregation ➔Hystrix dashboard Application centric metrics Servo/Spectator gauges, counters, timers sent to metrics store like Atlas Remote logging Logstash/Kibana or similar log aggregation and analysis frameworks Threshold monitoring and alerts Services like Atlas and PagerDuty for incident management Servo/ Spectator Hystrix/Turbine External Uptime Monitoring Metric/Event Repositories LogStash/Elastic Search/Kibana Incidents Atlas Vector
  23. 23. Security Dynamic Security Done in new ways NOT
  24. 24. Dynamic, Web Scale & Simpler Security Security Monkey Monitors security policies, tracks changes, alerts on situations Scumblr Searches internet for security “nuggets” (credentials, hacking discussions) Sketchy A safe way to collect text and screenshots from websites FIDO Automated event detection, analysis, enrichment & and enforcement Sleepy Puppy Delayed cross site scripting propagation testing framework Lemur x.509 certificate orchestration framework
  25. 25. What did we not cover? Over 50 github projects NetflixOSS is “Technical indigestion as a service” Big Data, Data Persistence and UI Engineering Big Data tools used well beyond Netflix Ephemeral, semi and fully persistent data systems Recent addition of UI OSS and Falcor
  26. 26. Agenda NetflixOSS Netflix Cloud Architecture Getting Started
  27. 27. How do I get started? All of the previous slides shows NetflixOSS components Code: Announcements: Want to get running a bit faster? ZeroToCloud Workshop for getting started with build/bake/deploy in Amazon EC2 ZeroToDocker Docker images that containing running Netflix technologies (not production ready, but easy to understand)
  28. 28. ZeroToDocker Demo Mac OS X Virtual Box Ubuntu 14.04 single kernel Container#1 Filesystem+ process Eureka Container ZuulContainer Another Container ... Docker running instances Single kernel Contained processes Zookeeper and Exhibitor A Microservices app and surrounding NetflixOSS services (Zuul to Karyon with Eureka)
  29. 29. Questions ?