Netflix Cloud Platform Building Blocks


Published on

Architectural Building Blocks of the Netflix Cloud Platform and lessons learned while implementing the same.

Commandments of Web Scale Cloud Deployments

Published in: Technology, Business

Netflix Cloud Platform Building Blocks

  1. 1. Ne#lix  Cloud  Pla#orm   Building  Blocks  Architectural  Building  Blocks   On     Amazon  Public  Cloud   Sudhir  Tonse              (@stonse)   #gitpronflx  
  2. 2. IntroducEon  
  3. 3. What  is  Ne#lix?   With  more  than  26  million  streaming   members  in  the  United  States,  Canada,   LaEn  America,  the  United  Kingdom  and   Ireland,  Ne#lix,  Inc.  (NASDAQ:  NFLX)  is   the  worlds  leading  internet  subscripEon   service  for  enjoying  movies  and  TV   programs.     …   In  all,  more  than  800  devices  that  stream   from  Ne#lix  are  available.     (hp://  hp://    
  4. 4. Who  Am  I  •  In  the  Movie  Business  J   –  Manager,  Cloud  Pla#orm/  Infrastructure  @  Ne#lix   –  @  Ne#lix  since  2008   –  Prior  day  jobs   •  System  Architect/Lead  @  AOL  (Netscape,  iPlanet,  Sun)  •             @stonse  •             hp://   Important:  This  talk  is  a  developer  community  outreach  by  me  as  an  individual  and     the  content  here  may  or  may  not  reflect  Ne#lix’s  official  view.  
  5. 5. Why  am  I  here?  •  Share  the  Story  of  Ne,lix  and  its  use  of  the   Amazon  Cloud   –  Why  did  Ne#lix  move  to  the  Cloud?   –  How  did  we  move?   –  What  did  we  learn?  •  Share  Technical  Challenges  and  SoluEons   –  Contribute  back  to  the  community  •  Perhaps  Interest  you  in  Helping  us  Reach  the   Next  Steps   –  Yes,  I  am  Hiring!  
  6. 6. What  is  in  it  for  You?  •  Various  Open  Source  Offerings  •  Tech  papers  •  Blogs  &  ArEcles  •  Meetups  and  Talks  like  this  J  
  7. 7. What’s  in  it  for  Ne#lix?   bird  in  a  Big  Cloud  •  Small  •  Tech  Community  Engagement  •  Open  Source  ContribuEons  
  8. 8. Cloud  
  9. 9. Cloud  •  What  is  it?  •  Why  Cloud?  
  10. 10. What’s  a  Cloud?  •  Cloud:  Cloud  compu<ng  is  the  delivery  of  compuEng  and  storage  capacity  [1]  as  a  service  [2]  to  a  heterogeneous  community  of  end-­‐recipients.   Images  Courtesy:  Wikipedia/Company  logos  
  11. 11. Cloud  Stack   Clients   Browsers,  Mobile,  Televisions  …     SaaS/ApplicaEons   Ne#lix     Apps/Services   PaaS   Ne#lix  ExecuEon  Env  (JVM),  Web/App  Servers,  Frameworks,  Tools   Cloud  Pla#orm   IaaS   Virtual  Machines,  Networking,  Load  Balancers  …      
  12. 12. Ne#lix  Cloud  Pla#orm  •  PaaS  Building  Blocks  on  top  of  Amazon’s  IaaS  
  13. 13. Pla#orm  Blocks   InternalizaEon/ App  Infrastructure   Messaging   L10N/Geo   Server   Client   Security   Big  Data    Tools/Frameworks   Design/Architecture   ConfiguraEon   DiagnosEcs   Management  
  14. 14. Web  Scale  •  Billions  of  Requests  per  day  •  Terabytes  of  data  per  day  •  Millions  of  Metric  data  points  per  day  •  Hundreds  of  services    
  15. 15. Why  Cloud  hp://­‐ne#lix-­‐api.html  
  16. 16. Why  Cloud  contd  …  •  UndifferenEated  Heavy  Lising   –  MulE  Region   –  On  Demand  CompuEng  Power   –  Tons  of  Features  J  
  17. 17. On  Demand  Auto  Scaling!  •  Traffic  Paerns   Compute   Compute  •  Scale  UP  &  Down   based  on  Demand   –  Use  CloudWatch   Time   •  RPS   Time   •  Load  Average   Slow  Growth   Periodic  Jobs   •  …   Compute   Compute   Compute   Time   Time   Time   Predictable  Bursts   Unpredictable  Bursts   Steady  State  
  18. 18. Instance   Instance   Instance   Instance   Instance   Instance   Instance   Instance   Scale  Up   Instance   Instance   Instance   Instance   Instance  Scale  Up  
  19. 19. Scale  Down   Instance   Instance   Instance   Instance   Instance   Instance   Instance  
  20. 20. Story  of  Ne#lix  
  21. 21. DataCenter  to  Cloud  Timeline  
  22. 22. DC  to  SOA  •  Old  DataCenter   •  Ne#lix  Cloud   (2008)   (2012)  •  Everything  in  one   •  100s  of  Fine   WebApp  (.war)     Grained  Services  
  23. 23. Old  Lessons  •  One  missing  Semi  Colon  can  bring  your  site   down!!!   –  Runaway  Thread     •  Lessons   –  Async  execuEon   –  Timed  gets()    (i.e.  use  java  Future)  
  24. 24. Deployment  Concepts   ApplicaEon   Cluster  1   Cluster  2   Cluster  n   ASG  1   ASG  2   ASG  1  Instances   Instances   Instances  
  25. 25. Sample  Deployment  Architecture  
  26. 26. Showcasing  Pla#orm  Components  
  27. 27. What’s  in  a  name?  •  Cloud  instances  are  ephemeral   –  They  have  no  fixed  NAME   –  The  have  a  public  IP  address,  a  private  IP  address   and  can  opEonally  be  associated  with  an  ElasEc  IP   Address   –  How  can  you  address  your  services?   •  Via  ElasEc  IP  (but  these  are  limited  per  account)   •  Route  53  (A  DNS  service  offered  by  Amazon)   •  Ne#lix  uses  in-­‐house  app  called  Discovery  Service   –  Keeper  of  addresses  and  metadata  of  running  instances   Shakespeare  
  28. 28. Inter  Process  CommunicaEon  •  Ne#lix  uses  NIWS   –  Ne#lix  Internal  Web  Services   –  Common  infrastructural  library  that  aids  in  RPC   •  Based  on  JSR-­‐311  (Jersey)   •  Uses  Discovery  Service  to  obtain  instances  of  every   service   •  Has  an  in  built  Mid  Tier  s/w  LoadBalancer   Sudhir  Tonse  
  29. 29. BiopSys   Danny  Yuan  •  Search  Logs  on  1000s  of  Amazon  Instances   –  Per  Cluster,  Apps,  Instances,  Time  Range  etc.  
  30. 30. S3  DiagnosEcs  •  Help  Debug  S3  Latency  if  any     Sudhir  Tonse  
  31. 31. Metrics  •  One  cannot  fully  Understand  what  One  cannot   Observe  J  •  Ne#lix  Pla#orm  has  several  Metrics/Data   CollecEon  components   –  Servo  (  @Monitors)   –  Tracers/Counters   –  Chukwa  (for  Log  Events  and  Business  Metrics)   –  More  J  
  32. 32. Metrics   @royrapaport  
  33. 33. Cassandra  Dashboard  •  Visualize  Status  of  MulEple  Cassandra  Clusters   Eran  Landau  
  34. 34. Lessons  Learned  •  Roman  Riding  is  hard   –  e.g.  sharing  traffic  between  Datacenter  (SQL)  and   Cloud  (NoSQL)  •  Plan  for  Failure   –  Test  for  Failure  (Chaos  Monkey  &  Simian  Army)  
  35. 35. Commandments  of  Web  Scale     Cloud  Deployment  
  36. 36. Cloud  Commandments  1.  Thou shalt not have Sticky in-memory sessions –  Hard to Scale2.  Thou shalt not direclty use a Central SQL database in the user request path –  Atleast not one that uses locks and transactions3.  Thou shalt not store important data on ephemeral instances –  These are lost when instances go down. Use EBS volumes, S3 or other persistence stores4.  Thou shalt embrace a homogenous architecture –  Much easier to achieve operational efficiency5.  Thou shalt understand and embrace the CAP theorem –  Choice between CP and AP. Most web scale deployments choose AP6.  Thou shalt gaurd all external calls using the Dependency Command Pattern –  Idea is to effectively gaurd user request procesing threads7.  Thou shalt be prepared to scale according to thy needs –  Web traffic can come in bursts, its important to scale up/down the whole SOA stack based on resources needed
  37. 37. Cloud  Commandments  contd…  8.  Thou shalt keep a wary eye on thy cost –  It all adds up eventually. Plenty of low hanging fruits avaialble to save costs9.  Thou shalt secure thy data and instances –  Encrypt data; secure access to instances. (Pay attention to Security Groups)10. Thou shalt instrument thy code –  You cant trust what you cant see11. Thou shalt effectively monitor thy access points –  Its the cloud and things can go wrong or go reaaaal slooow12. Thou shalt deploy thy instances in multiple regions and zones –  For maximizing SLAs and availability13. Thou shalt be wary of SPOF –  Mantra of distributed system design14. Thou shalt always plan for failure –  Its just a question of when, not if. Have a good backup plan
  38. 38. Concepts  •  Throling/Metering   –  Thundering  Herd  (Retry  Storms)  •  Graceful  DegradaEon   –  Appropriate  Fallbacks  
  39. 39. Metering  •  Protocol  Level   –  NIWS  features   •  Client  Side  Guards   •  Service  Side  Metering  •  Client  API  level   –  Dependency  Command  Paern  
  40. 40. Dependency  Command  Paern   hp://­‐tolerance     Ben  Christensen  
  41. 41. Dependency  Command     Effect  of  Latency  …  
  42. 42. Dependency  Command  
  43. 43. Dependency  Command  •  network  Emeouts  and  retries  •  separate  threads  on  per-­‐dependency  thread   pools  •  semaphores  (via  a   tryAcquire,  not  a  blocking  call)  •  circuit  breakers  
  44. 44. Failures  •  Failures  will  happen   –  It’s  a  quesEon  of  when  and  how  NOT  “if”   –  Plan   •  Regularly  Test  for  possible  Failures   –  Ne#lix  Simian  Army:  e.g.  Chaos  Monkey,  Latency  Monkey  …   •  Severity   –  Minimize  the  impact  of  a  failure   •  Occurrence   –  Minimize  the  frequency  of  a  failure   •  Observability   –  Minimize  the  Eme  to  detect  and  respond  
  45. 45. Simian  Army  Chaos  Monkey     •  Simulates  hard  failures  in  AWS  by  killing  a  few   instances  per  ASG  (e.g.  Auto  Scale  Group)     •  Similar  to  how  EC2  instances  can  be  killed  by   AWS  with  lile  warning     •  Tests  clientsʼ  ability  to  gracefully  deal  with   broken  connecEons,  interrupted  calls,  etc...     •  Verifies  that  all  services  are  running  within  the   protecEon  of  AWS  Auto  Scale  Groups,  which   reincarnates  killed  instances   •  If  not,  the  Chaos  monkey  will  win!     Conformity  Monkey  .     •  Verifies  that  all  services  are  running   within  the  protecEon  of  AWS  Auto  Scale   Groups,  which  reincarnates  killed   instances   •  If  not,  app/service  team  is  noEfied  
  46. 46. Simian  Army  …  Latency  Monkey     •  Simulates  sos  failures  -­‐-­‐  i.e.  a  service   gets  slower     •  Injects  random  delays  in  NIWS  (client-­‐ side)  or  Server  (server-­‐side)  of  a  client-­‐ server  interacEon   •  Tests  the  ability  of  applicaEons  to  detect   and  recover  (i.e.  Graceful  DegradaEon)   from  the  harder  problem  of  delays,  that   leads  to  thundering  herd  and  Emeouts    Other  Monkeys     •  Security  Monkey   Chaos  Gorilla   •  Janitor  Monkey   •  Simulates  Zone  Outage   •  Efficiency  Monkey   •  ..  more    
  47. 47. Building  Redundancy  and  Availability  •  Deploy  in  mulEple  zone  and  consider  mulEple  regions  •  Pay  aenEon  to  various  modes  of  failures  
  48. 48. Three  Balanced  Availability  Zones     Load  Balancers   Zone  A   Zone  B   Zone  C   Persistence  Store   Persistence  Store   Persistence  Store   Courtesy  @adrianco  
  49. 49. Triple  Replicated  Persistence     Load  Balancers   Zone  A   Zone  B   Zone  C  Persistence  Store   Persistence  Store   Persistence  Store  
  50. 50. Isolated  Regions   US-­‐East  Load  Balancers   EU-­‐West  Load  Balancers   Zone  A   Zone  B   Zone  C   Zone  A   Zone  B   Zone  C  Persistence  Store   Persistence  Store   Persistence  Store   Cassandra  Replicas   Cassandra  Replicas   Cassandra  Replicas  
  51. 51. Cassandra  Global  Ring   Reference:  hp://    
  52. 52. Tips    Guidelines  
  53. 53. Tips    Guidelines  contd  …  •  Amazon  CloudWatch   –  Is  your  friend!  Ne#lix  Servo  ( hp://  helps  you  publish   metrics  to  CloudWatch  •  ELB   –  Always  keep  your  Zones  Balanced!   –  Healthcheck  URLs  are  important    •  Auto  Scaling  Groups   –  This  is  an  amazing  feature  that  can  really  save  you  $$ $s  and  help  you  run  more  efficiently.  Read   hp://  
  54. 54. Tips    Guidelines  contd  …  •  Keep  acEve  track  of  Usage  Costs   –  Usage  costs  can  surprise  you!   –  Ne#lix  has  an  internal  tool  which  we  may  open   source.  Watch  @Ne#lixOSS  •  Reserve  Instances   –  ReservaEon  can  save  you  $$$s  (upto  71%  !!)   (YMMV)   –  Guarantees  availability  when  you  need  it  
  55. 55. Tips/Guidelines  •  S3  Best  PracEces   –  Amazon  doc:  hp://   –  Know  when  to  use  Regional  S3  Endpoints   •  Important  when  your  dev/test  team  and  deployments  are  in   different  regions   –  Use  Smart  Bucket/Key  naming   •  Use  3  to  63  characters.   •  Use  only  lower  case  leers  (at  least  one),  numbers,  .  and  -­‐.   •  Dont  start  or  end  the  bucket  name  with  .  and  dont  follow  or   precede  a  .  with  a  -­‐.   –  Compress  Data   –  Use  TTLs     –  Many  more  …    
  56. 56. Open  Source  
  57. 57. Open  Source    •  @Ne#lixOSS  •  hp://  •  Built  for  the  CLOUD  
  58. 58. How  can  you  benefit?  
  59. 59. Deployment  Tool  •  ASGARD  
  60. 60. ConfiguraEon  Management  •  Archaius  (ProperEes  Management)  •  More   Coming   Soon  …  
  61. 61. NoSQL  Persistence   Cassandra  based  offerings  •  Priam  (Token  Management)  •  Astyanax  (Cassandra  Client)  •  Jmeter  plugin  for  Load  Tests  
  62. 62. Technical  Knowledge  Sharing  •  hp://   –  Cloud  Usage   –  PersonalizaEon    RecommendaEons   –  Hadoop  and  Big  Data  papers   –  CDN  (Content  Delivery  Networks)   –  General  Architectural  Guidelines   –  Performance    Scalability  •  Slideshare   –  hp:// searchfrom=headerq=Ne#lix  
  63. 63. New  Challenges  •  More  Global  Expansion  •  Real  Time  Data  Infrastructure  •  March  towards  Connuous  IntegraEon  and   Deployment  
  64. 64. Ne#lix  •  Freedom  and  Responsibility   –  Empower  engineers   –  #DevOps   –  Context  not  Control  
  65. 65. Want  to  Join  us?  hp://    
  66. 66. Credits  Adrian  Cockros    (@adrianco),    Ruslan  Meshenberg  (@rusmeshenberg),  Yury  Izrailevsky,    Joe  Sondow  (@joesondow),  Ben  Christensen  (@benchristensen),  Jordan  Zimmerman  (@randgalt),  Ariel  Tseltlin  (@atseitlin),    Allen  Wang,  Eran  Landau,  Danny  Yuan,    Pradeep  Kamath    And      Members  of  the  Ne#lix  Cloud  Pla#orm  Team    
  67. 67. Q    A  •                 @stonse  
  68. 68. Amazon Cloud Terminology Reference See This is not a full list of Amazon Web Service features (courtesy @adrianco)•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaEon  code)  •  EC2  –  ElasEc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraEons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosEng  cloud  instances   –  Region  –  group  of  Availability  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan  •  ASG  –  Auto  Scaling  Group  (instances  booEng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (hp  access)  •  EBS  –  ElasEc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDS  –  RelaEonal  Database  Service  (managed  MySQL  master  and  slaves)  •  SDB  –  Simple  Data  Base  (hosted  hp  based  NoSQL  data  store)  •  SQS  –  Simple  Queue  Service  (hp  based  message  queue)  •  SNS  –  Simple  NoEficaEon  Service  (hp  and  email  based  topics  and  messages)  •  EMR  –  ElasEc  Map  Reduce  (automaEcally  managed  Hadoop  cluster)  •  ELB  –  ElasEc  Load  Balancer  •  EIP  –  ElasEc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (extension  of  enterprise  datacenter  network  into  cloud)  •  IAM  –  IdenEty  and  Access  Management  (fine  grain  role  based  security  keys)