Netflix in the Cloud at SV Forum
Upcoming SlideShare
Loading in...5
×
 

Netflix in the Cloud at SV Forum

on

  • 15,575 views

Talk given at SVForum, Sunnyvale CA, March 27th 2012

Talk given at SVForum, Sunnyvale CA, March 27th 2012

Statistics

Views

Total Views
15,575
Views on SlideShare
13,801
Embed Views
1,774

Actions

Likes
44
Downloads
11
Comments
1

18 Embeds 1,774

http://www.scoop.it 1360
http://www.initcron.org 235
http://www.newvem.com 76
http://feeds.feedburner.com 22
https://twimg0-a.akamaihd.net 16
https://twitter.com 15
http://newvem.staging.wpengine.com 13
http://localhost 10
http://www.thedevopsblog.com 6
https://si0.twimg.com 6
http://www.techgig.com 3
https://www.linkedin.com 2
https://www.google.com 2
http://pmomale-ld1 2
http://translate.googleusercontent.com 2
http://us-w1.rockmelt.com 2
http://www.linkedin.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • not download
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Netflix in the Cloud at SV Forum Netflix in the Cloud at SV Forum Presentation Transcript

  • Cloud  Architecture  at  Ne0lix  How  Ne0lix  Built  a  Scalable  Java  oriented  PaaS  running  on  AWS   SVForum  March  27th,  2012   Adrian  Cockcro9   @adrianco  #ne=lixcloud   h@p://www.linkedin.com/in/adriancockcro9  
  • Adrian  Cockcro9  •  Director,  Architecture  for  Cloud  Systems,  Ne=lix  Inc.   –  Previously  Director  for  PersonalizaOon  Pla=orm  •  DisOnguished  Availability  Engineer,  eBay  Inc.  2004-­‐7   –  Founding  member  of  eBay  Research  Labs  •  DisOnguished  Engineer,  Sun  Microsystems  Inc.  1988-­‐2004   –  2003-­‐4  Chief  Architect  High  Performance  Technical  CompuOng   –  2001  Author:  Capacity  Planning  for  Web  Services   –  1999  Author:  Resource  Management   –  1995  &  1998  Author:  Sun  Performance  and  Tuning   –  1996  Japanese  EdiOon  of  Sun  Performance  and  Tuning   •   SPARC  &  Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)  •  More   –  Twi@er  @adrianco  –  Blog  h@p://perfcap.blogspot.com   –  PresentaOons  at  h@p://www.slideshare.net/adrianco  
  • Why  Ne=lix,  Why  Cloud,  Why   AWS   Part  1  of  3  
  • What  kind  of  Cloud?  •  So9ware  as  a  Service  –  SaaS   –  Replaces  in  house  applicaOons   –  Targets  end  users  •  Pla=orm  as  a  Service  –  PaaS   –  Replaces  in  house  operaOons  funcOons   –  Targets  developers  •  Infrastructure  as  a  Service  –  IaaS   –  Replaces  in  house  datacenter  capacity   –  Targets  developers  and  ITops  
  • What  Ne=lix  Did  •  Moved  to  SaaS   –  Corporate  IT  –  OneLogin,  Workday,  Box,  Evernote…   –  Tools  –  Pagerduty,  AppDynamics,  ElasOc  MapReduce  •  Built  our  own  PaaS  <-­‐  today’s  focus   –  Customized  to  make  our  developers  producOve   –  When  we  started,  we  had  li@le  choice  •  Moved  incremental  capacity  to  IaaS   –  No  new  datacenter  space  since  2008  as  we  grew   –  Moved  our  streaming  apps  to  the  cloud  
  • Why  Use  Public  Cloud?  
  • Things  We  Don’t  Do  
  • Be@er  Business  Agility  
  • Data  Center   Ne=lix  could  not   build  new   datacenters  fast   enough   Capacity  growth  is  acceleraOng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  Xbox   InternaOonal  –  Canada,  LaOn  America,  UK/Ireland  
  • Ne=lix.com  is  now  ~100%  Cloud   A  few  small  back  end  data  sources  sOll  in  progress   All  internaOonal  product  is  cloud  based   USA  specific  logisOcs  remains  in  the  Datacenter   Working  on  SOX,  PCI  as  scope  starts  to  include  AWS  
  • Ne=lix  Choice  was  AWS  with  our   own  pla=orm  and  tools   Unique  pla=orm  requirements  and   extreme  scale,  agility  and  flexibility  
  • Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  features  and  automaOon  Use  AWS  zones  and  regions  for  high  availability,   scalability  and  global  deployment  
  • But  isn’t  Amazon  a  compeOtor?  Many  products  that  compete  with  Amazon  run  on  AWS   We  are  a  “poster  child”  for  the  AWS  Architecture   Ne=lix  is  one  of  the  biggest  AWS  customers   Co-­‐opeOOon  -­‐  compeOtors  are  also  partners  
  • Could  Ne=lix  use  another  cloud?   Would  be  nice,  we  use  three  interchangeable  CDN  Vendors   But  no-­‐one  else  has  the  scale  and  features  of  AWS   You  have  to  be  this  tall  to  ride  this  ride…   Maybe  in  2-­‐3  years?  
  • We  want  to  use  clouds,   we  don’t  have  Ome  to  build  them   Public  cloud  for  agility  and  scale  We  use  electricity  too,  but  don’t  want  to  build  our  own  power  staOon…  AWS  because  they  are  big  enough  to  allocate  thousands  of  instances  per   hour  when  we  need  to  
  • What  about  other  PaaS?  •  CloudFoundry  –  Open  Source  by  VMWare   –  Developer-­‐friendly,  easy  to  get  started   –  Missing  scale  and  some  enterprise  features  •  Rightscale   –  Widely  used  to  abstract  away  from  AWS   –  Creates  it’s  own  lock-­‐in  problem…  •  AWS  is  growing  into  this  space   –  We  didn’t  want  a  vendor  between  us  and  AWS   –  We  wanted  to  build  a  thin  PaaS,  that  gets  thinner  
  • Ne=lix  Deployed  on  AWS   2009   2009   2010   2010   2010   2011  Content   Logs   Play   WWW   API   CS   Video   InternaOonal   Masters   S3   DRM   Sign-­‐Up   Metadata   CS  lookup   Device   DiagnosOcs   EC2   EMR  Hadoop   CDN  rouOng   Search   Config   &  AcOons   Movie   TV  Movie   Customer   S3   Hive   Bookmarks   Choosing   Choosing   Call  Log   Business   Social   CDNs   Logging   RaOngs   Facebook   CS  AnalyOcs   Intelligence  
  • Cloud  Architecture  Pa@erns   Where  do  we  start?  
  • Goals  •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenOle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verOcally  scaled  databases   –  Leverage  AWS  elasOc  capacity  effecOvely  •  Available   –  SubstanOally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulOple  AWS  availability  zones   –  No  scheduled  down  Ome,  no  central  database  schema  to  change  •  ProducOve   –  OpOmize  agility  of  a  large  development  team  with  automaOon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
  • Datacenter  AnO-­‐Pa@erns   What  do  we  currently  do  in  the  datacenter  that  prevents  us  from   meeOng  our  goals?    
  • Rewrite  from  Scratch  Not  everything  is  cloud  specific   Pay  down  technical  debt   Robust  pa@erns  
  • Ne=lix  Datacenter  vs.  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL  SOcky  In-­‐Memory  Session   Shared  Memcached  Session   Cha@y  Protocols   Latency  Tolerant  Protocols  Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  Pa@erns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  • So9ware  Architecture  Pa@erns  •  Object  Models   –  Basic  and  derived  types,  facets,  serializable   –  Pass  by  reference  within  a  service   –  Pass  by  value  between  services  •  ComputaOon  and  I/O  Models   –  Service  ExecuOon  using  Best  Effort  /  Futures   –  Common  thread  pool  management   –  Circuit  breakers  to  manage  and  contain  failures  
  • Model  Driven  Architecture  •  TradiOonal  Datacenter  PracOces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa@erns   –  Some  use  of  Puppet  to  automate  changes  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producOon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaOon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
  • Ne=lix  PaaS  Principles  •  Maximum  FuncOonality   –  Developer  producOvity  and  agility  •  Leverage  as  much  of  AWS  as  possible   –  AWS  is  making  huge  investments  in  features/scale  •  Interfaces  that  isolate  Apps  from  AWS   –  Avoid  lock-­‐in  to  specific  AWS  API  details  •  Portability  is  a  long  term  goal   –  Gets  easier  as  other  vendors  catch  up  with  AWS  
  • Ne=lix  Global  PaaS  •  Architecture  Features  and  Overview  •  Portals  and  Explorers  •  Pla=orm  Services  •  Pla=orm  APIs  •  Pla=orm  Frameworks  •  Persistence  •  Scalability  Benchmark  
  • Global  PaaS?   Toys  are  nice,  but  this  is  the  real  thing…  •  Supports  all  AWS  Availability  Zones  and  Regions  •  Supports  mulOple  AWS  accounts  {test,  prod,  etc.}  •  Cross  Region/Acct  Data  ReplicaOon  and  Archiving  •  InternaOonalized,  Localized  and  GeoIP  rouOng  •  Security  is  fine  grain,  dynamic  AWS  keys  •  Autoscaling  to  thousands  of  instances  •  Monitoring  for  millions  of  metrics  •  ProducOve  for  100s  of  developers  on  one  product  •  23M+  users  USA,  Canada,  LaOn  America,  UK,  Eire  
  • Basic  PaaS  EnOOes  •  AWS  Based  EnOOes   –  Instances  and  Machine  Images,  ElasOc  IP  Addresses   –  Security  Groups,  Load  Balancers,  Autoscale  Groups   –  Availability  Zones  and  Geographic  Regions  •  Ne=lix  PaaS  EnOOes   –  ApplicaOons  (registered  services)   –  Clusters  (versioned  Autoscale  Groups  for  an  App)   –  ProperOes  (dynamic  hierarchical  configuraOon)  
  • Core  PaaS  Services  •  AWS  Based  Services   –  S3  storage,  to  5TB  files,  parallel  mulOpart  writes   –  SQS  –  Simple  Queue  Service.  Messaging  layer.  •  Ne=lix  Based  Services   –  EVCache  –  memcached  based  ephemeral  cache   –  Cassandra  –  distributed  data  store  •  External  Services   –  GeoIP  Lookup  interfaced  to  a  vendor   –  Keystore  HSM  in  Ne=lix  Datacenter  
  • Instance  Architecture  Linux  Base  AMI  (CentOS  or  Ubuntu)   OpOonal   Apache   frontend,   Java  (JDK  6  or  7)  memcached,  non-­‐java  apps   Tomcat   AppDynamics   appagent   Monitoring   Log  rotaOon   ApplicaOon  servlet,  base   Healthcheck,  status   to  S3   GC  and  thread   server,  pla=orm,  interface   servlets,  JMX  interface,  AppDynamics   dump  logging   jars  for  dependent  services   Servo  autoscale  machineagent   Epic    
  • Security  Architecture  •  Instance  Level  Security  baked  into  base  AMI   –  Login:  ssh  only  allowed  via  portal  (not  between  instances)   –  Each  app  type  runs  as  its  own  userid  app{test|prod}  •  AWS  Security,  IdenOty  and  Access  Management   –  Each  app  has  its  own  security  group  (firewall  ports)   –  Fine  grain  user  roles  and  resource  ACLs  •  Key  Management   –  AWS  Keys  dynamically  provisioned,  easy  updates   –  High  grade  app  specific  key  management  support  
  • Portals  and  Explorers  •  Ne=lix  ApplicaOon  Console  (NAC)   –  Primary  AWS  provisioning/config  interface  •  AWS  Usage  Analyzer   –  Breaks  down  costs  by  applicaOon  and  resource  •  Cassandra  Explorer   –  Browse  clusters,  keyspaces,  column  families  •  Base  Server  Explorer   –  Browse  service  endpoints  configuraOon,  perf  
  • Pla=orm  Services  •  Discovery  –  service  registry  for  “ApplicaOons”  •  IntrospecOon  –  Entrypoints  •  Cryptex  –  Dynamic  security  key  management  •  Geo  –  Geographic  IP  lookup  •  Pla=ormservice  –  Dynamic  property  configuraOon  •  LocalizaOon  –  manage  and  lookup  local  translaOons  •  Evcache  –  ephemeral  volaOle  cache  •  Cassandra  –  Cross  zone/region  distributed  data  store  •  Zookeeper  –  Distributed  CoordinaOon  (Curator)  •  Various  proxies  –  access  to  old  datacenter  stuff  
  • Metrics  Framework  •  System  and  ApplicaOon   –  CollecOon,  AggregaOon,  Querying  and  ReporOng   –  Non-­‐blocking  logging,  avoids  log4j  lock  contenOon   –  Honu-­‐Streaming  -­‐>  S3  -­‐>  EMR  -­‐>  Hive  •  Performance,  Robustness,  Monitoring,  Analysis   –  Tracers,  Counters  –  explicit  code  instrumentaOon  log   –  Real  Time  Tracers/Counters   –  SLA  –  service  level  response  Ome  percenOles   –  Servo  annotated  JMX  extract  to  Cloudwatch  •  Latency  Monkey  Infrastructure   –  Inject  random  delays  into  service  responses  
  • Ne0lix  Pla0orm  Persistence  •  Ephemeral  VolaOle  Cache  –  evcache   –  Discovery-­‐aware  memcached  based  backend   –  Client  abstracOons  for  zone  aware  replicaOon   –  OpOon  to  write  to  all  zones,  fast  read  from  local  •  Cassandra   –  Highly  available  and  scalable  (more  later…)  •  MongoDB   –  Complex  object/query  model  for  small  scale  use  •  MySQL   –  Hard  to  scale,  legacy  and  small  relaOonal  models  
  • Priam  –  Cassandra  AutomaOon   Available  at  h@p://github.com/ne=lix  •  Ne=lix  Pla=orm  Tomcat  Code  •  Zero  touch  auto-­‐configuraOon  •  State  management  for  Cassandra  JVM  •  Token  allocaOon  and  assignment  •  Broken  node  auto-­‐replacement  •  Full  and  incremental  backup  to  S3  •  Restore  sequencing  from  S3  •  Grow/Shrink  Cassandra  “ring”  
  • Astyanax   Available  at  h@p://github.com/ne=lix  •  Cassandra  java  client  •  API  abstracOon  on  top  of  Thri9  protocol  •  “Fixed”  ConnecOon  Pool  abstracOon  (vs.  Hector)   –  Round  robin  with  Failover   –  Retry-­‐able  operaOons  not  Oed  to  a  connecOon   –  Ne=lix  PaaS  Discovery  service  integraOon   –  Host  reconnect  (fixed  interval  or  exponenOal  backoff)   –  Token  aware  to  save  a  network  hop  –  lower  latency   –  Latency  aware  to  avoid  compacOng/repairing  nodes  –  lower  variance  •  Batch  mutaOon:  set,  put,  delete,  increment  •  Simplified  use  of  serializers  via  method  overloading  (vs.  Hector)  •  ConnecOonPoolMonitor  interface  for  counters  and  tracers  •  Composite  Column  Names  replacing  deprecated  SuperColumns  
  • Astyanax  Query  Example  Paginate  through  all  columns  in  a  row  ColumnList<String>  columns;  int  pageize  =  10;  try  {          RowQuery<String,  String>  query  =  keyspace                  .prepareQuery(CF_STANDARD1)                  .getKey("A")                  .setIsPaginaOng()                  .withColumnRange(new  RangeBuilder().setMaxSize(pageize).build());                                      while  (!(columns  =  query.execute().getResult()).isEmpty())  {                  for  (Column<String>  c  :  columns)  {                  }          }  }  catch  (ConnecOonExcepOon  e)  {  }      
  • High  Availability  •  Cassandra  stores  3  local  copies,  1  per  zone   –  Synchronous  access,  durable,  highly  available   –  Read/Write  One  fastest,  least  consistent  -­‐  ~1ms   –  Read/Write  Quorum  2  of  3,  consistent  -­‐  ~3ms  •  AWS  Availability  Zones   –  Separate  buildings   –  Separate  power  etc.   –  Fairly  close  together    
  • “TradiOonal”  Cassandra  Write  Data  Flows   Single  Region,  MulOple  Availability  Zone,  Not  Token  Aware   Cassandra   • Disks   • Zone  A   2   2   4   2  1.  Client  Writes  to  any   Cassandra  3   3   Cassandra   If  a  node  goes  offline,   Cassandra  Node   • Disks   5 • Disks   5   hinted  handoff  2.  Coordinator  Node   • Zone  C   1 • Zone  A   completes  the  write   replicates  to  nodes   when  the  node  comes   and  Zones   Non  Token   back  up.  3.  Nodes  return  ack  to   Aware     coordinator   Clients   Requests  can  choose  to  4.  Coordinator  returns   3   wait  for  one  node,  a   Cassandra   Cassandra   ack  to  client   • Disks   • Disks   5   quorum,  or  all  nodes  to  5.  Data  wri@en  to   • Zone  C   • Zone  B   ack  the  write   internal  commit  log     disk  (no  more  than   Cassandra   SSTable  disk  writes  and   • Disks   10  seconds  later)   • Zone  B   compacOons  occur   asynchronously  
  • Astyanax  -­‐  Cassandra  Write  Data  Flows   Single  Region,  MulOple  Availability  Zone,  Token  Aware   Cassandra   • Disks   • Zone  A  1.  Client  Writes  to   Cassandra  2   2   Cassandra   If  a  node  goes  offline,   nodes  and  Zones   • Disks   3 • Disks   3   hinted  handoff  2.  Nodes  return  ack  to   • Zone  C   1 • Zone  A   completes  the  write   client  3.  Data  wri@en  to   Token   when  the  node  comes   back  up.   internal  commit  log   Aware     disks  (no  more  than   Clients   2   Requests  can  choose  to   10  seconds  later)   Cassandra   Cassandra   wait  for  one  node,  a   • Disks   • Disks   3   quorum,  or  all  nodes  to   • Zone  C   • Zone  B   ack  the  write     Cassandra   SSTable  disk  writes  and   • Disks   • Zone  B   compacOons  occur   asynchronously  
  • Data  Flows  for  MulO-­‐Region  Writes   Token  Aware,  Consistency  Level  =  Local  Quorum  1.  Client  writes  to  local  replicas   If  a  node  or  region  goes  offline,  hinted  handoff  2.  Local  write  acks  returned  to   completes  the  write  when  the  node  comes  back  up.   Client  which  conOnues  when   Nightly  global  compare  and  repair  jobs  ensure   2  of  3  local  nodes  are   everything  stays  consistent.   commi@ed  3.  Local  coordinator  writes  to   remote  coordinator.     Cassandra   100+ms  latency  4.  When  data  arrives,  remote   Cassandra   •  Disks   •  Disks   •  Zone  A   •  Zone  A   coordinator  node  acks  and   Cassandra   2   2   Cassandra   Cassandra   4   Cassandra   6   6   3   5   Disks  6   copies  to  other  remote  zones   6   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  •  Zone  C   4  Disks  A   •  •  Zone   1   4  5.  Remote  nodes  ack  to  local   US   EU   coordinator   Clients   Clients   Cassandra   2   Cassandra   Cassandra   5   Cassandra  6.  Data  flushed  to  internal   •  Disks   •  Zone  C   •  Disks   6   •  Zone  B   •  Disks   •  Zone  C   •  Disks  6   •  Zone  B   commit  log  disks  (no  more   Cassandra   Cassandra   than  10  seconds  later)   •  Disks   •  Disks   •  Zone  B   •  Zone  B  
  • Cassandra  Backup    •  Full  Backup   Cassandra   Cassandra   Cassandra   –  Time  based  snapshot   –  SSTable  compress  -­‐>  S3   Cassandra   Cassandra  •  Incremental   S3   Backup   Cassandra   Cassandra   –  SSTable  write  triggers   compressed  copy  to  S3   Cassandra   Cassandra  •  Archive   Cassandra   Cassandra   –  Copy  cross  region   A  
  • ETL  for  Cassandra  •  Data  is  de-­‐normalized  over  many  clusters!  •  Too  many  to  restore  from  backups  for  ETL  •  SoluOon  –  read  backup  files  using  Hadoop  •  Aegisthus   –  h@p://techblog.ne=lix.com/2012/02/aegisthus-­‐bulk-­‐data-­‐pipeline-­‐out-­‐of.html   –  High  throughput  raw  SSTable  processing   –  Re-­‐normalizes  many  clusters  to  a  consistent  view   –  Extract,  Transform,  then  Load  into  Teradata  
  • Cassandra  Archive   A   Appropriate  level  of  paranoia  needed…  •  Archive  could  be  un-­‐readable   –  Restore  S3  backups  weekly  from  prod  to  test,  and  daily  ETL  •  Archive  could  be  stolen   –  PGP  Encrypt  archive  •  AWS  East  Region  could  have  a  problem   –  Copy  data  to  AWS  West  •  ProducOon  AWS  Account  could  have  an  issue   –  Separate  Archive  account  with  no-­‐delete  S3  ACL  •  AWS  S3  could  have  a  global  problem   –  Create  an  extra  copy  on  a  different  cloud  vendor….  
  • Tools  and  AutomaOon  •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArOfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches  •  Custom  Ne=lix  ApplicaOon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producOon  •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop   –  Datastax  support  for  Cassandra,  AWS  support  for  Hadoop  via  EMR  •  Monitoring  Tools   –  Alert  processing  gateway  into  Pagerduty   –  AppDynamics  –  Developer  focus  for  cloud  h@p://appdynamics.com  
  • Open  Source  Strategy  •  Release  PaaS  Components  git-­‐by-­‐git   –  Source  at  github.com/ne=lix   –  Intros  and  techniques  at  techblog.ne=lix.com   –  Blog  post  or  new  code  every  week  or  so  •  MoOvaOons   –  Give  back  to  Apache  licensed  OSS  community   –  MoOvate,  retain,  hire  top  engineers   –  Create  a  community  that  adds  features  and  fixes  
  • Current  OSS  Projects  and  Posts  Github  /  Techblog   Priam   Exhibitor   Servo   Apache  Project   Techblog  Post   Astyanax   Curator   Autoscaling  scripts   CassJMeter   Zookeeper   Honu   Cassandra   EVCache   Circuit  Breaker   Aegisthus  
  • Scalability  TesOng  •  Cloud  Based  TesOng  –  fricOonless,  elasOc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel  •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra  •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaOon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
  • <DrEvil>ONE  MILLION</DrEvil>  
  • Scale-­‐Up  Linearity   h@p://techblog.ne=lix.com/2011/11/benchmarking-­‐cassandra-­‐scalability-­‐on.html   Client  Writes/s  by  node  count  –  ReplicaJon  Factor  =  3  1200000   1099837  1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
  • Availability  and  Resilience  
  • Chaos  Monkey  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill  •  ApplicaOon  configuraOon  opOon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
  • Responsibility  and  Experience  •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail  •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”  •  Keep  Omeouts  short,  fail  fast   –  Don’t  let  cascading  Omeouts  stack  up  •  Make  configuraOon  opOons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opOon  
  • Resilient  Design  –  Circuit  Breakers  h@p://techblog.ne=lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  
  • PaaS  OperaOonal  Model  •  Developers   –  Provision  and  run  their  own  code  in  producOon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs  •  DevOps  and  PaaS  (aka  NoOps)   –  DevOps  is  used  to  build  and  run  the  PaaS   –  PaaS  constrains  Dev  to  use  automaOon  instead   –  PaaS  puts  more  responsibility  on  Dev,  with  tools  
  • What’s  Le9  for  Corp  IT?  •  Corporate  Security  and  Network  Management   –  Billing  and  remnants  of  streaming  service  back-­‐ends  in  DC  •  Running  Ne=lix’  DVD  Business   –  Tens  of  Oracle  instances   Corp  WiFi  Performance   –  Hundreds  of  MySQL  instances   –  Thousands  of  VMWare  VMs   –  Zabbix,  CacO,  Splunk,  Puppet  •  Employee  ProducOvity   –  Building  networks  and  WiFi   –  SaaS  OneLogin  SSO  Portal   –  Evernote  Premium,  Safari  Online  Bookshelf,  Dropbox  for  Teams   –  Google  Enterprise  Apps,  Workday  HCM/Expense,  Box.com   –  Many  more  SaaS  migraOons  coming…  
  • ImplicaOons  for  IT  OperaOons  •  Cloud  is  run  by  developer  organizaOon   –  Product  group’s  “IT  department”  is  the  AWS  API  and  PaaS   –  CorpIT  handles  billing  and  some  security  funcOons  Cloud  capacity  is  10x  bigger  than  Datacenter   –  Datacenter  oriented  IT  didn’t  scale  up  as  we  grew   –  We  moved  a  few  people  out  of  IT  to  do  DevOps  for  our  PaaS  •  TradiOonal  IT  Roles  and  Silos  are  going  away   –  We  don’t  have  SA,  DBA,  Storage,  Network  admins  for  cloud   –  Developers  deploy  and  “run  what  they  wrote”  in  producOon  
  • Ne=lix  PaaS  OrganizaOon   Developer  Org  ReporOng  into  Product  Development,  not  ITops   Ne=lix  Cloud  Pla=orm  Team   Cloud  Ops   Build  Tools   Pla=orm  and   Cloud   Cloud   Reliability   Architecture   and   Database   Performance   SoluOons  Engineering   AutomaOon   Engineering   Perforce  Jenkins   Pla=orm  jars   Cassandra   Future  planning   ArOfactory  JIRA   Benchmarking   Monitoring   Alert  RouOng   Key  store   Security  Arch   Monkeys  Incident  Lifecycle   Base  AMI,  Bakery   Zookeeper   JVM  GC  Tuning   Efficiency   Ne=lix  App  Console   Wiresharking   Entrypoints   Cassandra   AWS  VPC   PagerDuty   Hyperguard   AWS  API   AWS  Instances   AWS  Instances   AWS  Instances   Powerpoint  J  
  • Roadmap  for  2012  •  Readiness  for  global  Ne=lix  launches  •  More  resiliency  and  improved  availability  •  More  automaOon,  orchestraOon  •  “Hardening”  the  pla=orm  •  Lower  latency  for  web  services  and  devices  •  Working  towards  IPv6  support  •  More  open  sourced  components  
  • Wrap  Up     Answer  your  remaining  quesOons…    What  was  missing  that  you  wanted  to  cover?    Next  up  –  Jason  Chan  on  Security  Architecture  
  • Takeaway     Ne>lix  has  built  and  deployed  a  scalable  global  Pla>orm  as  a  Service.    Key  components  of  the  Ne>lix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     h@p://github.com/Ne=lix   h@p://techblog.ne=lix.com   h@p://slideshare.net/Ne=lix     h@p://www.linkedin.com/in/adriancockcro9   @adrianco  #ne=lixcloud