Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SV Forum Platform Architecture SIG - Netflix Open Source Platform


Published on

Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on

Published in: Technology

SV Forum Platform Architecture SIG - Netflix Open Source Platform

  1. 1. The  Ne&lix  Open  Source   Pla&orm   September  26th,  2012   Adrian  Cockcro8,  Ruslan  Meshenberg     @adrianco  @rusmeshenberg  #neAlixcloud   hCp://   hCp://    
  2. 2. What  NeAlix  Did  •  Moved  to  SaaS   –  Corporate  IT  –  OneLogin,  Workday,  Box,  Evernote…   –  Tools  –  Pagerduty,  AppDynamics,  ElasVc  MapReduce  •  Built  our  own  PaaS   –  Customized  to  make  our  developers  producVve   –  When  we  started,  we  had  liCle  choice  •  Moved  incremental  capacity  to  IaaS   –  No  new  datacenter  space  since  2008  as  we  grew   –  Moved  our  streaming  apps  to  the  cloud  
  3. 3. Why  Use  Cloud?      
  4. 4. Things  we  don’t  do  
  5. 5. NeAlix  Choice  was  AWS  with  our   own  plaAorm  and  tools   Unique  plaAorm  requirements  and   extreme  scale,  agility  and  flexibility  
  6. 6. Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  features  and  automaVon  Use  AWS  zones  and  regions  for  high  availability,   scalability  and  global  deployment  
  7. 7. What  about  other  PaaS?  •  CloudFoundry  –  Open  Source  by  VMWare   –  Developer-­‐friendly,  easy  to  get  started   –  Missing  scale  and  some  enterprise  features  •  Rightscale   –  Widely  used  to  abstract  away  from  AWS   –  Creates  it’s  own  lock-­‐in  problem…  •  AWS  is  growing  into  this  space   –  We  didn’t  want  a  vendor  between  us  and  AWS   –  We  wanted  to  build  a  thin  PaaS,  that  gets  thinner  
  8. 8. What  do  developers  care  about?  
  9. 9. Keeping  up  with  Developer  Trends   In  producVon   at  NeAlix  •  Big  Data/Hadoop   2009  •  AWS  Cloud   2009  •  ApplicaVon  Performance  Management   2010  •  Integrated  DevOps  PracVces   2010  •  ConVnuous  IntegraVon/Delivery   2010  •  NoSQL   2010  •  PlaAorm  as  a  Service;  Fine  grain  SOA   2010  •  Social  coding,  open  development/github   2011  
  10. 10. AWS  specific  feature  dependence….      
  11. 11. Portability  vs.  FuncVonality  •  Portability  –  the  OperaVons  focus   –  Avoid  vendor  lock-­‐in   –  Support  datacenter  based  use  cases   –  Possible  operaVons  cost  savings  •  FuncVonality  –  the  Developer  focus   –  Less  complex  test  and  debug,  one  mature  supplier   –  Faster  Vme  to  market  for  your  products   –  Possible  developer  cost  savings  
  12. 12. Portable  PaaS  •  Portable  IaaS  Base  -­‐  some  AWS  compaVbility   –  Eucalyptus  –  AWS  licensed  compaVble  subset   –  CloudStack  –  Citrix  Apache  project   –  OpenStack  –  Rackspace,  Cloudscaling,  HP  etc.  •  Portable  PaaS   –  VMWare  Cloud  Foundry  -­‐  run  it  yourself  in  your  DC   –  AppFog  and  Stackato  –  Cloud  Foundry/Openstack   –  Vendor  opVons:  Rightscale,  Enstratus,  Smartscale  
  13. 13. FuncVonal  PaaS  •  IaaS  base  -­‐  all  the  features  of  AWS   –  Very  large  scale,  mature,  global,  evolving  rapidly   –  ELB,  Autoscale,  VPC,  SQS,  EIP,  EMR,  DynamoDB  etc.   –  Large  files  (TB)  and  mulVpart  writes  in  S3  •  FuncVonal  PaaS  –  NeAlix  added  features   –  Very  large  scale,  mature,  flexible,  customizable   –  Asgard  console,  Monkeys,  Big  data  tools   –  Cassandra/Zookeeper  data  store  automaVon  
  14. 14. Developers  choose  FuncVonal     Don’t  let  the  roadie  write  the  set  list!  (yes  you  do  need  all  those  guitars  on  tour…)  
  15. 15. Freedom  and  Responsibility  •  Developers  leverage  cloud  to  get  freedom   –  Agility  of  a  single  organizaVon,  no  silos  •  But  now  developers  are  responsible   –  For  compliance,  performance,  availability  etc.   “As  far  as  my  rehab  is  concerned,  it  is  within  my   ability  to  change  and  change  for  the  be>er  -­‐  Eddie   Van  Halen”    
  16. 16. Amazon Cloud Terminology Reference See This is not a full list of Amazon Web Service features•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaVon  code)  •  EC2  –  ElasVc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraVons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosVng  cloud  instances   –  Region  –  group  of  Avail  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan,  SA-­‐Brazil,  US-­‐Gov  •  ASG  –  Auto  Scaling  Group  (instances  booVng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (hCp  access)  •  EBS  –  ElasVc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDS  –  RelaVonal  Database  Service  (managed  MySQL  master  and  slaves)  •  DynamoDB/SDB  –  Simple  Data  Base  (hosted  hCp  based  NoSQL  datastore,  DynamoDB  replaces  SDB)  •  SQS  –  Simple  Queue  Service  (hCp  based  message  queue)  •  SNS  –  Simple  NoVficaVon  Service  (hCp  and  email  based  topics  and  messages)  •  EMR  –  ElasVc  Map  Reduce  (automaVcally  managed  Hadoop  cluster)  •  ELB  –  ElasVc  Load  Balancer  •  EIP  –  ElasVc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (single  tenant,  more  flexible  network  and  security  constructs)  •  DirectConnect  –  secure  pipe  from  AWS  VPC  to  external  datacenter  •  IAM  –  IdenVty  and  Access  Management  (fine  grain  role  based  security  keys)  
  17. 17. What  Runs  in  the  Cloud?   Step  by  Step  NeAlix  Product   TransiVon  
  18. 18. Non-­‐Member  Web  Site  
  19. 19. Member  Web  Site  
  20. 20. Content  Delivery  Service  
  21. 21. NeAlix  APIs  
  22. 22. Streaming  Device  API   Netflix Ready Devices From: May 2008 To: May 2010
  23. 23. Current  Architectural  PaCerns  for  Availability  •  Isolated  Services   –  Resilient  Business  logic  •  Three  Balanced  Availability  Zones   –  Resilient  to  Infrastructure  outage  •  Triple  Replicated  Persistence   –  Durable  distributed  Storage  •  Isolated  Regions   –  US  and  EU  don’t  take  each  other  down  
  24. 24. Isolated  Services    Test  With  Chaos  Monkey,  Latency  Monkey
  25. 25. Three  Balanced  Availability  Zones   Test  with  Chaos  Gorilla   Load  Balancers   Zone  A   Zone  B   Zone  C  Cassandra  and  Evcache   Cassandra  and  Evcache   Cassandra  and  Evcache   Replicas   Replicas   Replicas  
  26. 26. Triple  Replicated  Persistence   Cassandra  maintenance  drops  individual  replicas     Load  Balancers   Zone  A   Zone  B   Zone  C  Cassandra  and  Evcache   Cassandra  and  Evcache   Cassandra  and  Evcache   Replicas   Replicas   Replicas  
  27. 27. Isolated  Regions   US-­‐East  Load  Balancers   EU-­‐West  Load  Balancers   Zone  A   Zone  B   Zone  C   Zone  A   Zone  B   Zone  C  Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas  
  28. 28. Failure  Modes  and  Effects  Failure  Mode   Probability   Mi;ga;on  Plan  ApplicaVon  Failure   High   AutomaVc  degraded  response  AWS  Region  Failure   Low   Wait  for  region  to  recover  AWS  Zone  Failure   Medium   ConVnue  to  run  on  2  out  of  3  zones  Datacenter  Failure   Medium   Migrate  more  funcVons  to  cloud  Data  store  failure   Low   Restore  from  S3  backups  S3  failure   Low   Restore  from  remote  archive  
  29. 29. Observed  Regional  Failures  •  Power  Outages   –  PlaAorm  survives  any  one  zone  outage   –  Two  recent  zone  outages,  one  OK,  one  triggered  a  bug  •  Router  Bug  Takes  Region  Offline   –  A  few  minutes  of  no  network  traffic,  then  recovered   –  AWS  has  redesigned  routes  to  be  per  zone  •  Control  Plane  Overload  Affects  EnVre  Region   –  Consequence  of  other  outages   –  We  lose  control  of  our  infrastructure  
  30. 30. NeAlix  Deployed  on  AWS   2009   2009   2010   2010   2010   2011  Content   Logs   Play   WWW   API   CS   Content   S3   InternaVonal   Management   DRM   Sign-­‐Up   Metadata   CS  lookup   Terabytes   EC2   Device   DiagnosVcs   EMR   CDN  rouVng   Search   Config   &  AcVons   Encoding   S3   Movie   TV  Movie   Customer   Hive  &  Pig   Bookmarks   Choosing   Choosing   Call  Log   Petabytes   Business   Social   Logging   RaVngs   Facebook   CS  AnalyVcs   Intelligence   CDNs   ISPs   Terabits   Customers  
  31. 31. Cloud  Architecture  PaCerns   Where  do  we  start?  
  32. 32. Datacenter  to  Cloud  TransiVon  Goals  •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenVle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verVcally  scaled  databases   –  Leverage  AWS  elasVc  capacity  effecVvely  •  Available   –  SubstanVally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulVple  AWS  availability  zones   –  No  scheduled  down  Vme,  no  central  database  schema  to  change  •  ProducVve   –  OpVmize  agility  of  a  large  development  team  with  automaVon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
  33. 33. NeAlix  Datacenter  vs.  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL  SVcky  In-­‐Memory  Session   Shared  Memcached  Session   ChaCy  Protocols   Latency  Tolerant  Protocols  Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  PaCerns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  34. 34. Availability  and  Resilience  
  35. 35. Chaos  Monkey  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Friday  9am-­‐3pm  random  instance  kill  •  ApplicaVon  configuraVon  opVon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
  36. 36. Responsibility  and  Experience  •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail  •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”  •  Keep  Vmeouts  short,  fail  fast   –  Don’t  let  cascading  Vmeouts  stack  up  •  Make  configuraVon  opVons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opVon  
  37. 37. Resilient  Design  –  Circuit  Breakers  hCp://­‐tolerance-­‐in-­‐high-­‐volume.html  
  38. 38. Distributed  OperaVonal  Model  •  Developers   –  Provision  and  run  their  own  code  in  producVon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs  •  DevOps  and  PaaS  (aka  NoOps)   –  DevOps  is  used  to  build  and  run  the  PaaS   –  PaaS  constrains  Dev  to  use  automaVon  instead   –  PaaS  puts  more  responsibility  on  Dev,  with  tools  
  39. 39. What’s  Le8  for  Corp  IT?  •  Corporate  Security  and  Network  Management   –  Billing  and  remnants  of  streaming  service  back-­‐ends  in  DC  •  Running  NeAlix’  DVD  Business   –  Tens  of  Oracle  instances   Corp  WiFi  Performance   –  Hundreds  of  MySQL  instances   –  Thousands  of  VMWare  VMs   –  Zabbix,  CacV,  Sumologic,  Puppet,  Chef  •  Employee  ProducVvity   –  Building  networks  and  WiFi   –  SaaS  OneLogin  SSO  Portal   –  Evernote  Premium,  Safari  Online  Bookshelf,  Dropbox  for  Teams   –  Google  Enterprise  Apps,  Workday  HCM/Expense,   –  Many  more  SaaS  migraVons  coming…  
  40. 40. NeAlix  OrganizaVon   DevOps  Org  ReporVng  into  Product  Group,  not  ITops   NeAlix  Cloud  PlaAorm  Team   Cloud  Ops   Build  Tools   PlaAorm  and   Cloud   Cloud   Reliability   Architecture   and   Persistence   Performance   SoluVons  Engineering   AutomaVon   Engineering   Perforce  Jenkins   PlaAorm  jars   Cassandra   Future  planning   ArVfactory  JIRA   Benchmarking   Monitoring   Alert  RouVng   Key  store   Security  Arch   Monkeys  Incident  Lifecycle   Base  AMI,  Bakery   Zookeeper   JVM  GC  Tuning   Efficiency   NeAlix  App  Console   Wiresharking   Entrypoints   Cassandra   AWS  VPC   PagerDuty   Hyperguard   AWS  API   AWS  Instances   AWS  Instances   AWS  Instances   Powerpoint  J  
  41. 41. NeAlix  Open  Source  Strategy  •  Steadily  release  PaaS  Components  git-­‐by-­‐git    •  Source  at  –  we  build  from  it…    •  Intros  and  techniques  at  
  42. 42. Give  back  to  Apache  licensed  OSS   community    
  43. 43. Lead  the  Best  PracVces  
  44. 44. MoVvate,  regain,  hire  top  engineers  
  45. 45. “Peer  Pressure”  code  cleanup  
  46. 46. External  contribuVons  
  47. 47. Clean  Code  is  Re-­‐usable  •  Use  by  other  teams  and  projects  inside  NeAlix  
  48. 48. Timeline  
  49. 49. hCp://  
  50. 50. Simian  Army  (Chaos  Monkey)   hCp://­‐monkey-­‐released-­‐into-­‐wild.html      
  51. 51. Asgard  hCp://­‐web-­‐based-­‐cloud-­‐management-­‐and.html  
  52. 52. Astyanax,  Priam,  Curator,  Exhibitor      
  53. 53. AcVve  Pipeline      
  54. 54. Instance  creaVon   Bakery  &  Build  tools   Asgard   Base  AMI   Instance   Autoscaling  ApplicaVon   Odin   scripts   Code   Image  baked   ASG  /  Instance  started   Instance  Running  
  55. 55. RunVme   Governator   Eureka   Async   logging   Archaius   Entrypoints   Servo   Registering,  ApplicaVon  iniValizing   configuraVon  
  56. 56. RunVme,  Cont’d   Astyanax   Priam   Curator   Chaos  Monkey   Latency  Monkey   NIWS  LB   Exhibitor   Janitor  Monkey   Cass  JMeter   Dependency   REST  client   Command   Explorers  Calling  other  services   Managing  service   Resiliency  aids  
  57. 57. Open  Source  Projects   Legend   Github  /  Techblog   Priam   Exhibitor   Servo  and  Autoscaling  Scripts  Apache  ContribuVons   Cassandra  as  a  Service   Zookeeper  as  a  Service   Astyanax   Curator   Honu   Techblog  Post   Cassandra  client  for  Java   Zookeeper  PaCerns   Log4j  streaming  to  Hadoop   Coming  Soon   CassJMeter   EVCache   Circuit  Breaker   Cassandra  test  suite   Memcached  as  a  Service   Robust  service  paCern   Cassandra  MulV-­‐region  EC2   Eureka  /  Discovery   Asgard  AutoScaleGroup  based   datastore  support   Service  Directory   AWS  console   Aegisthus   Archaius   Chaos  Monkey   Hadoop  ETL  for  Cassandra   Dynamics  ProperVes  Service   Robustness  verificaVon   Explorers   EntryPoints   Latency  Monkey   Governator  Library  lifecycle   Server-­‐side  latency/error   and  dependency  injecVon   injecVon   Janitor  Monkey   Odin   REST  Client  +  mid-­‐Ver  LB   Bakeries  and  AMI   Workflow  orchestraVon   Async  logging   ConfiguraVon  REST  endpoints   Build  dynaslaves  
  58. 58. Repeat  a8er  me…  
  59. 59. Roadmap  for  2012  •  More  resiliency  and  improved  availability  •  More  automaVon,  orchestraVon  •  “Hardening”  the  plaAorm,  code  clean-­‐up  •  Lower  latency  for  web  services  and  devices  •  IPv6  –  now  running  in  prod,  rollout  in  process  •  More  open  sourced  components  •  See  you  at  AWS  Re:Invent  in  November…  
  60. 60. Takeaway     NeElix  has  built  and  deployed  a  scalable  global  PlaEorm  as  a  Service.    Key  components  of  the  NeElix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     hCp://   hCp://   hCp://     hCp://   hCp://     @adrianco  @rusmeshenberg  #neAlixcloud