SlideShare a Scribd company logo
The	
  Global	
  Ne+lix	
  Pla+orm	
  
  A	
  Large	
  Scale	
  Java	
  oriented	
  PaaS	
  running	
  on	
  AWS	
  


                   October	
  24th,	
  2011	
  
                    Adrian	
  Cockcro6	
  
                  @adrianco	
  #ne9lixcloud	
  
          h=p://www.linkedin.com/in/adriancockcro6	
  
Ne9lix	
  Inc.	
  
   With	
  more	
  than	
  20	
  million	
  streaming	
  members	
  in	
  the	
  
   United	
  States,	
  Canada	
  and	
  La8n	
  America,	
  Ne<lix,	
  Inc.	
  
    is	
  the	
  world's	
  leading	
  Internet	
  subscrip8on	
  service	
  for	
  
                       enjoying	
  movies	
  and	
  TV	
  shows.	
  
                                            	
  
                           Interna8onal	
  Expansion	
  
           Ne<lix,	
  Inc.,	
  the	
  leading	
  global	
  Internet	
  movie	
  
    subscrip8on	
  service…	
  announced	
  it	
  will	
  expand	
  to	
  the	
  
              United	
  Kingdom	
  and	
  Ireland	
  in	
  early	
  2012.	
  

Source:	
  h=p://ir.ne9lix.com	
  
The	
  Global	
  Ne9lix	
  Pla9orm	
  

             Ne9lix	
  Cloud	
  MigraLon	
  
  Ne9lix	
  Pla9orm	
  Services	
  and	
  Interfaces	
  
       Highly	
  Available	
  and	
  Globally	
  
                Distributed	
  Data	
  
       Scalability	
  and	
  Performance	
  
Why	
  Use	
  Public	
  Cloud?	
  
Things	
  We	
  Don’t	
  Do	
  
Be=er	
  Business	
  Agility	
  
Data	
  Center	
                  Ne9lix	
  could	
  not	
  
                                     build	
  new	
  
                                  datacenters	
  fast	
  
                                      enough	
  

  Capacity	
  growth	
  is	
  acceleraLng,	
  unpredictable	
  
  Product	
  launch	
  spikes	
  -­‐	
  iPhone,	
  Wii,	
  PS3,	
  XBox	
  
Out-­‐Growing	
  Data	
  Center	
  
             h=p://techblog.ne9lix.com/2011/02/redesigning-­‐ne9lix-­‐api.html   	
  


                               37x	
  Growth	
  Jan	
  
                               2010-­‐Jan	
  2011	
  


Datacenter	
  
Capacity	
  
Ne9lix.com	
  is	
  now	
  ~100%	
  Cloud	
  
    A	
  few	
  small	
  back	
  end	
  data	
  sources	
  sLll	
  in	
  progress	
  
            All	
  internaLonal	
  product	
  is	
  cloud	
  based	
  
     USA	
  specific	
  logisLcs	
  remains	
  in	
  the	
  Datacenter	
  
  Working	
  aggressively	
  on	
  billing,	
  PCI	
  compliance	
  on	
  AWS	
  
Ne9lix	
  Choice	
  was	
  AWS	
  with	
  our	
  
   own	
  pla9orm	
  and	
  tools	
  
     Unique	
  pla9orm	
  requirements	
  and	
  
     extreme	
  scale,	
  agility	
  and	
  flexibility	
  
Leverage	
  AWS	
  Scale	
  
   “the	
  biggest	
  public	
  cloud”	
  
 AWS	
  investment	
  in	
  features	
  and	
  automaLon	
  
Use	
  AWS	
  zones	
  and	
  regions	
  for	
  high	
  availability,	
  
         scalability	
  and	
  global	
  deployment	
  
But	
  isn’t	
  Amazon	
  a	
  compeLtor?	
  
Many	
  products	
  that	
  compete	
  with	
  Amazon	
  run	
  on	
  AWS	
  
  We	
  are	
  a	
  “poster	
  child”	
  for	
  the	
  AWS	
  Architecture	
  
      Ne9lix	
  is	
  one	
  of	
  the	
  biggest	
  AWS	
  customers	
  
         Strategy	
  –	
  turn	
  compeLtors	
  into	
  partners	
  
Could	
  Ne9lix	
  use	
  another	
  cloud?	
  
 Would	
  be	
  nice,	
  we	
  use	
  three	
  interchangeable	
  CDN	
  Vendors	
  
    But	
  no-­‐one	
  else	
  has	
  the	
  scale	
  and	
  features	
  of	
  AWS	
  
            You	
  have	
  to	
  be	
  this	
  tall	
  to	
  ride	
  this	
  ride…	
  
                               Maybe	
  in	
  2-­‐3	
  years?	
  
We	
  want	
  to	
  use	
  clouds,	
  
     we	
  don’t	
  have	
  Lme	
  to	
  build	
  them	
  
                             Public	
  cloud	
  for	
  agility	
  and	
  scale	
  
We	
  use	
  electricity	
  too,	
  but	
  don’t	
  want	
  to	
  build	
  our	
  own	
  power	
  staLon…	
  
AWS	
  because	
  they	
  are	
  big	
  enough	
  to	
  allocate	
  thousands	
  of	
  instances	
  per	
  
                                     hour	
  when	
  we	
  need	
  to	
  
Ne9lix	
  Deployed	
  on	
  AWS	
  

Content	
          Logs	
              Play	
              WWW	
             API	
               CS	
  
    Video	
                                                                                    InternaLonal	
  
   Masters	
             S3	
              DRM	
             Sign-­‐Up	
     Metadata	
          CS	
  lookup	
  


                                                                               Device	
         DiagnosLcs	
  
     EC2	
         EMR	
  Hadoop	
     CDN	
  rouLng	
       Search	
          Config	
           &	
  AcLons	
  


                                                             Movie	
         TV	
  Movie	
       Customer	
  
      S3	
              Hive	
         Bookmarks	
          Choosing	
       Choosing	
           Call	
  Log	
  


                     Business	
                                                Social	
  
    CDNs	
                                Logging	
          RaLngs	
        Facebook	
        CS	
  AnalyLcs	
  
                   Intelligence	
  
Amazon Cloud Terminology Reference
     See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

•    AWS	
  –	
  Amazon	
  Web	
  Services	
  (common	
  name	
  for	
  Amazon	
  cloud)	
  
•    AMI	
  –	
  Amazon	
  Machine	
  Image	
  (archived	
  boot	
  disk,	
  Linux,	
  Windows	
  etc.	
  plus	
  applicaLon	
  code)	
  
•    EC2	
  –	
  ElasLc	
  Compute	
  Cloud	
  
       –    Range	
  of	
  virtual	
  machine	
  types	
  m1,	
  m2,	
  c1,	
  cc,	
  cg.	
  Varying	
  memory,	
  CPU	
  and	
  disk	
  configuraLons.	
  
       –    Instance	
  –	
  a	
  running	
  computer	
  system.	
  Ephemeral,	
  when	
  it	
  is	
  de-­‐allocated	
  nothing	
  is	
  kept.	
  
       –    Reserved	
  Instances	
  –	
  pre-­‐paid	
  to	
  reduce	
  cost	
  for	
  long	
  term	
  usage	
  
       –    Availability	
  Zone	
  –	
  datacenter	
  with	
  own	
  power	
  and	
  cooling	
  hosLng	
  cloud	
  instances	
  
       –    Region	
  –	
  group	
  of	
  Availability	
  Zones	
  –	
  US-­‐East,	
  US-­‐West,	
  EU-­‐Eire,	
  Asia-­‐Singapore,	
  Asia-­‐Japan,	
  US-­‐Gov	
  
•    ASG	
  –	
  Auto	
  Scaling	
  Group	
  (instances	
  booLng	
  from	
  the	
  same	
  AMI)	
  
•    S3	
  –	
  Simple	
  Storage	
  Service	
  (h=p	
  access)	
  
•    EBS	
  –	
  ElasLc	
  Block	
  Storage	
  (network	
  disk	
  filesystem	
  can	
  be	
  mounted	
  on	
  an	
  instance)	
  
•    RDS	
  –	
  RelaLonal	
  Database	
  Service	
  (managed	
  MySQL	
  master	
  and	
  slaves)	
  
•    SDB	
  –	
  Simple	
  Data	
  Base	
  (hosted	
  h=p	
  based	
  NoSQL	
  data	
  store)	
  
•    SQS	
  –	
  Simple	
  Queue	
  Service	
  (h=p	
  based	
  message	
  queue)	
  
•    SNS	
  –	
  Simple	
  NoLficaLon	
  Service	
  (h=p	
  and	
  email	
  based	
  topics	
  and	
  messages)	
  
•    EMR	
  –	
  ElasLc	
  Map	
  Reduce	
  (automaLcally	
  managed	
  Hadoop	
  cluster)	
  
•    ELB	
  –	
  ElasLc	
  Load	
  Balancer	
  
•    EIP	
  –	
  ElasLc	
  IP	
  (stable	
  IP	
  address	
  mapping	
  assigned	
  to	
  instance	
  or	
  ELB)	
  
•    VPC	
  –	
  Virtual	
  Private	
  Cloud	
  (extension	
  of	
  enterprise	
  datacenter	
  network	
  into	
  cloud)	
  
•    IAM	
  –	
  IdenLty	
  and	
  Access	
  Management	
  (fine	
  grain	
  role	
  based	
  security	
  keys)	
  
Boot	
  Camp	
  
•  One	
  day	
  “Ne9lix	
  Cloud	
  Training”	
  class	
  
    –  Has	
  been	
  run	
  5	
  Lmes	
  for	
  20-­‐45	
  people	
  each	
  Lme	
  
•  Half	
  day	
  of	
  presentaLons	
  
•  Half	
  day	
  hands-­‐on	
  
    –  Create	
  your	
  own	
  hello	
  world	
  app	
  
    –  Launch	
  in	
  AWS	
  test	
  account	
  
    –  Login	
  to	
  your	
  cloud	
  instances	
  
    –  Find	
  monitoring	
  data	
  on	
  your	
  cloud	
  instances	
  
    –  Connect	
  to	
  Cassandra	
  and	
  read/write	
  data	
  
Ne9lix	
  Built	
  a	
  PaaS!	
  
•  Ne9lix	
  Cloud	
  Systems	
  team	
  (50+	
  rock-­‐stars	
  :)	
  
    –  VP	
  Cloud	
  Systems	
  (Yury	
  Izrailevsky)	
  
    –  Site	
  Reliability	
  Engineering	
  (@jedberg)	
  Hiring++!	
  
    –  Cloud	
  Performance	
  (Denis	
  Sheahan)	
  
    –  Database	
  Engineering	
  -­‐	
  Cassandra+MySQL	
  (@r39132)	
  	
  
    –  Pla9orm	
  Engineering	
  –	
  Astyanax	
  (Eran	
  Landau)	
  
    –  Cloud	
  Tools	
  Engineering	
  –	
  Jenkins	
  (@cquinn)	
  
    –  Cloud	
  SoluLons	
  Team	
  –	
  Monkeys	
  (@atseitlin)	
  
    –  Security	
  (Jason	
  Chan)	
  
    –  Architecture	
  (@adrianco)	
  
Ne9lix	
  Global	
  PaaS	
  
•    Architecture	
  Features	
  and	
  Overview	
  
•    Portals	
  and	
  Explorers	
  
•    Pla9orm	
  Services	
  
•    Pla9orm	
  APIs	
  
•    Pla9orm	
  Frameworks	
  
•    Persistence	
  
•    Scalability	
  Benchmark	
  
Global	
  PaaS?	
  
            Toys	
  are	
  nice,	
  but	
  this	
  is	
  the	
  real	
  thing…	
  

•    Supports	
  all	
  AWS	
  Availability	
  Zones	
  and	
  Regions	
  
•    Supports	
  mulLple	
  AWS	
  accounts	
  {test,	
  prod,	
  etc.}	
  
•    Cross	
  Region/Acct	
  Data	
  ReplicaLon	
  and	
  Archiving	
  
•    InternaLonalized,	
  Localized	
  and	
  GeoIP	
  rouLng	
  
•    Security	
  is	
  fine	
  grain,	
  dynamic	
  AWS	
  keys	
  
•    Autoscaling	
  to	
  thousands	
  of	
  instances	
  
•    Monitoring	
  for	
  millions	
  of	
  metrics	
  
•    20M+	
  users	
  USA,	
  Canada,	
  LaLn	
  America	
  (UK,	
  Eire)	
  
Instance	
  Architecture	
  

Linux	
  Base	
  AMI	
  (currently	
  Centos	
  5)	
  
   OpLonal	
  
   Apache	
  
  frontend,	
  
                          Java	
  (choice	
  of	
  JDK	
  6	
  or	
  7)	
  
memcached,	
  
non-­‐java	
  apps	
  


                                                    Tomcat	
  
                          AppDynamics	
  
                            appagent	
  
 Monitoring	
  
 Log	
  rotaLon	
                                     ApplicaLon	
  servlet,	
  base	
  
    to	
  S3	
                                                                                  Healthcheck,	
  status	
  
                          GC	
  and	
  thread	
      server,	
  pla9orm,	
  interface	
  
AppDynamics	
                                                                                  servlets,	
  JMX	
  interface	
  
                          dump	
  logging	
         jars	
  for	
  dependent	
  services	
  
machineagent	
  
        Epic	
  	
  
Security	
  Architecture	
  
•  Instance	
  Level	
  Security	
  baked	
  into	
  base	
  AMI	
  
    –  Login	
  via	
  ssh	
  only	
  allowed	
  via	
  portal	
  
    –  Each	
  app	
  type	
  runs	
  as	
  its	
  own	
  userid	
  app{test|prod}	
  
•  AWS	
  Security,	
  IdenLty	
  and	
  Access	
  Management	
  
    –  Each	
  app	
  has	
  its	
  own	
  security	
  group	
  (firewall	
  ports)	
  
    –  Fine	
  grain	
  user	
  roles	
  and	
  resource	
  ACLs	
  
•  Key	
  Management	
  
    –  AWS	
  Keys	
  dynamically	
  provisioned,	
  easy	
  updates	
  
    –  High	
  grade	
  app	
  key	
  management	
  support	
  
Core	
  Pla9orm	
  Frameworks	
  and	
  APIs	
  
Portals	
  and	
  Explorers	
  
•  Ne9lix	
  ApplicaLon	
  Console	
  (NAC)	
  
    –  Primary	
  AWS	
  provisioning/config	
  interface	
  
•  AWS	
  Usage	
  Analyzer	
  
    –  Breaks	
  down	
  costs	
  by	
  applicaLon	
  and	
  resource	
  
•  SimpleDB	
  Explorer	
  
    –  Browse	
  domains,	
  items,	
  a=ributes,	
  values	
  
•  Cassandra	
  Explorer	
  
    –  Browse	
  clusters,	
  keyspaces,	
  column	
  families	
  
•  Base	
  Server	
  Explorer	
  
    –  Browse	
  service	
  endpoints	
  configuraLon,	
  perf	
  
Global Netflix Platform
Global Netflix Platform
AWS	
  Usage	
  
for	
  test,	
  carefully	
  omi|ng	
  any	
  $	
  numbers…   	
  
Cassandra	
  Explorer	
  
Cassandra	
  Explorer	
  
Pla9orm	
  Services	
  
•    Discovery	
  –	
  service	
  registry	
  for	
  “applicaLons”	
  
•    IntrospecLon	
  –	
  Entrypoints	
  
•    Cryptex	
  –	
  Dynamic	
  security	
  key	
  management	
  
•    Geo	
  –	
  Geographic	
  IP	
  lookup	
  
•    Pla9ormservice	
  –	
  Dynamic	
  property	
  configuraLon	
  
•    LocalizaLon	
  –	
  manage	
  and	
  lookup	
  local	
  translaLons	
  
•    Evcache	
  –	
  eccentric	
  volaLle	
  (mem)cached	
  
•    Cassandra	
  –	
  Persistence	
  
•    Zookeeper	
  -­‐	
  CoordinaLon	
  
•    Various	
  proxies	
  –	
  access	
  to	
  old	
  datacenter	
  stuff	
  
IntrospecLon	
  -­‐	
  Entrypoints	
  
•  REST	
  API	
  for	
  tools,	
  apps,	
  explorers,	
  monkeys…	
  
   –  E.g.	
  GET	
  /REST/v1/instance/$INSTANCE_ID	
  


•  AWS	
  Resources	
  
   –  Autoscaling	
  Groups,	
  EIP	
  Groups,	
  Instances	
  


•  Ne9lix	
  PaaS	
  Resources	
  
   –  Discovery	
  ApplicaLons,	
  Clusters	
  of	
  ASGs,	
  History	
  
Entrypoints	
  Queries	
  
        MongoDB	
  is	
  good	
  for	
  low	
  traffic	
  complex	
  queries	
  against	
  complex	
  objects      	
  
DescripAon	
                                                       Range	
  expression	
  
Find	
  all	
  acLve	
  instances.	
  	
                           all()	
  
Find	
  all	
  instances	
  associated	
  with	
  a	
  group	
     %(cloudmonkey)	
  
name.	
  
Find	
  all	
  instances	
  associated	
  with	
  a	
              /^cloudmonkey$/discovery()	
  
discovery	
  group. 	
  	
  
Find	
  all	
  auto	
  scale	
  groups	
  with	
  no	
  instances.	
   asg(),-­‐has(INSTANCES;asg())	
  
How	
  many	
  instances	
  are	
  not	
  in	
  an	
  auto	
       count(all(),-­‐info(eval(INSTANCES;asg())))          	
  	
  
scale	
  group?	
  
What	
  groups	
  include	
  an	
  instance?	
                     *(i-­‐4e108521)	
  
What	
  auto	
  scale	
  groups	
  and	
  elasLc	
  load	
         filter(TYPE;asg,elb;*(i-­‐4e108521))	
  
balancers	
  include	
  an	
  instance?	
  
What	
  instance	
  has	
  a	
  given	
  public	
  ip?	
           filter(PUBLIC_IP;174.129.188.{0..255};all())	
  
Metrics	
  Framework	
  
•  System	
  and	
  ApplicaLon	
  
    –  CollecLon,	
  AggregaLon,	
  Querying	
  and	
  ReporLng	
  
    –  Non-­‐blocking	
  logging,	
  avoids	
  log4j	
  lock	
  contenLon	
  
    –  Chukwa	
  -­‐>	
  S3	
  -­‐>	
  EMR	
  -­‐>	
  Hive	
  
•  Performance,	
  Robustness,	
  Monitoring,	
  Analysis	
  
    –  Tracers,	
  Counters	
  –	
  explicit	
  code	
  instrumentaLon	
  log	
  
    –  Real	
  Time	
  Tracers/Counters	
  
    –  SLA	
  –	
  service	
  level	
  response	
  Lme	
  percenLles	
  
    –  Epic	
  (@MonitoredResources)	
  annotated	
  JMX	
  extract	
  
•  Latency	
  Monkey	
  Infrastructure	
  
    –  Inject	
  random	
  delays	
  into	
  service	
  responses	
  
ConfiguraAon	
  Management	
  
•  Ne9lixConfiguraLon	
  
     –  ValidaLon	
  Framework	
  
     –  Sitewide	
  ProperLes	
  Explorer	
  
•    Pla9ormService	
  
•    Mapping	
  Service	
  
•    ZooKeeper	
  (Curator)	
  
•    InstanceIdenLty	
  
Interprocess	
  CommunicaAon	
  
•  Discovery	
  Service	
  registry	
  for	
  “applicaLons”	
  
    –  “here	
  I	
  am”	
  call	
  every	
  30s,	
  drop	
  a6er	
  3	
  missed	
  
    –  “where	
  is	
  everyone”	
  call	
  
    –  Redundant,	
  distributed,	
  moving	
  to	
  Zookeeper	
  
•  NIWS	
  –	
  Ne9lix	
  Internal	
  Web	
  Service	
  client	
  
    –  So6ware	
  Middle	
  Tier	
  Load	
  Balancer	
  
    –  Failure	
  retry	
  moves	
  to	
  next	
  instance	
  
    –  Many	
  opLons	
  for	
  encoding,	
  etc.	
  
Security	
  Key	
  Management	
  
•  AKMS	
  
    –  Dynamic	
  Key	
  Management	
  interface	
  
    –  Update	
  AWS	
  keys	
  at	
  runLme,	
  no	
  restart	
  
    –  All	
  keys	
  stored	
  securely,	
  none	
  on	
  disk	
  or	
  in	
  AMI	
  
•  Cryptex	
  -­‐	
  Flexible	
  key	
  store	
  
    –  Low	
  grade	
  keys	
  processed	
  in	
  client	
  
    –  Medium	
  grade	
  keys	
  processed	
  by	
  Cryptex	
  service	
  
    –  High	
  grade	
  keys	
  processed	
  by	
  hardware	
  (Ingrian)	
  
AWS	
  Persistence	
  Services	
  
•  SimpleDB	
  
    –  Got	
  us	
  started,	
  migraLng	
  to	
  Cassandra	
  now	
  
    –  NFSDB	
  -­‐	
  Instrumented	
  wrapper	
  library	
  
    –  Domain	
  and	
  Item	
  sharding	
  (workarounds)	
  
•  S3	
  
    –  Upgraded/Instrumented	
  JetS3t	
  based	
  interface	
  
    –  Supports	
  mulLpart	
  upload	
  and	
  large	
  files	
  
    –  Global	
  S3	
  endpoint	
  management	
  
Ne+lix	
  Pla+orm	
  Persistence	
  
•  Eccentric	
  VolaLle	
  Cache	
  –	
  evcache	
  
    –  Discovery-­‐aware	
  memcached	
  based	
  backend	
  
    –  Client	
  abstracLons	
  for	
  zone	
  aware	
  replicaLon	
  
    –  OpLon	
  to	
  write	
  to	
  all	
  zones,	
  fast	
  read	
  from	
  local	
  
•  Cassandra	
  
    –  Highly	
  available	
  and	
  scalable	
  (more	
  later…)	
  
•  MongoDB	
  
    –  Complex	
  object/query	
  model	
  for	
  small	
  scale	
  use	
  
•  MySQL	
  
    –  Hard	
  to	
  scale,	
  legacy	
  and	
  small	
  relaLonal	
  models	
  
Aside:	
  Adrian’s	
  Rant	
  on	
  CAP	
  Theorem	
  
•    Instances	
  and	
  Networks	
  will	
  fail	
  
•    Network	
  failure	
  =	
  ParLLon	
  “P”	
  is	
  a	
  given	
  
•    Distributed	
  Systems:	
  two	
  choices	
  –	
  CP	
  or	
  AP	
  
•    “Vendor	
  claims	
  CA”	
  
      –  Usually	
  they	
  mean	
  available	
  when	
  instances	
  fail	
  
•  Master-­‐Slave	
  =	
  Consistent	
  when	
  ParLLoned	
  
      –  You	
  can’t	
  write	
  unless	
  you	
  can	
  see	
  the	
  master	
  
•  Quorum	
  =	
  Available	
  when	
  ParLLoned	
  
      –  Writes	
  proceed,	
  conflicts	
  will	
  be	
  patched	
  up	
  later	
  
Why	
  Cassandra?	
  
•  We	
  value	
  Availability	
  over	
  Consistency	
  –	
  AP	
  
    –  Cassandra	
  is	
  a	
  Java	
  distributed	
  systems	
  toolkit	
  
•  We	
  have	
  a	
  building	
  full	
  of	
  Java	
  engineers	
  
    –  Riak	
  is	
  in	
  Erlang	
  –	
  a	
  blessing	
  and	
  a	
  curse…	
  
•  We	
  want	
  FOSS	
  +	
  Support	
  
    –  Voldemort	
  doesn’t	
  have	
  a	
  support	
  model	
  
•  Writes	
  are	
  intrinsically	
  harder	
  than	
  reads	
  
    –  Hbase	
  is	
  opLmized	
  for	
  reads,	
  Cassandra	
  for	
  writes	
  
•  We	
  tested	
  Cassandra	
  and	
  it	
  works	
  for	
  us	
  
    –  Step	
  by	
  step	
  into	
  full	
  producLon	
  over	
  the	
  last	
  year	
  
Priam	
  –	
  Cassandra	
  AutomaLon	
  
              Coming	
  soon	
  to	
  h=p://github.com/ne9lix	
  

•    Ne9lix	
  Pla9orm	
  Tomcat	
  Code	
  
•    Zero	
  touch	
  auto-­‐configuraLon	
  
•    State	
  management	
  for	
  Cassandra	
  JVM	
  
•    Token	
  allocaLon	
  and	
  assignment	
  
•    Broken	
  node	
  auto-­‐replacement	
  
•    Full	
  and	
  incremental	
  backup	
  to	
  S3	
  
•    Restore	
  sequencing	
  from	
  S3	
  
Astyanax	
  
                       Coming	
  soon	
  to	
  h=p://github.com/ne9lix	
  

•  Cassandra	
  java	
  client	
  
•  API	
  abstracLon	
  on	
  top	
  of	
  Thri6	
  protocol	
  
•  “Fixed”	
  ConnecLon	
  Pool	
  abstracLon	
  (vs.	
  Hector)	
  
      –    Round	
  robin	
  with	
  Failover	
  
      –    Retry-­‐able	
  operaLons	
  not	
  Led	
  to	
  a	
  connecLon	
  
      –    Discovery	
  integraLon	
  
      –    Host	
  reconnect	
  (fixed	
  interval	
  or	
  exponenLal	
  backoff)	
  
      –    Token	
  aware	
  (in	
  development)	
  to	
  save	
  a	
  network	
  hop	
  
•    Ne9lix	
  style	
  configuraLon	
  (INFLibrary)	
  
•    Batch	
  mutaLon:	
  set,	
  put,	
  delete,	
  increment	
  
•    Simplified	
  use	
  of	
  serializers	
  via	
  method	
  overloading	
  (vs.	
  Hector)	
  
•    ConnecLonPoolMonitor	
  interface	
  for	
  counters	
  and	
  tracers	
  
•    Composite	
  Column	
  Names	
  replacing	
  deprecated	
  SuperColumns	
  
IniLalizing	
  Astyanax	
  
// Configuration either set in code or nfastyanax.properties
platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY
netflix.environment=test
default.astyanax.readConsistency=CL_QUORUM
default.astyanax.writeConsistency=CL_QUORUM
MyCluster.MyKeyspace.astyanax.servers=127.0.0.1

// Must initialize platform for discovery to work
NFLibraryManager.initLibrary(PlatformManager.class, props, false, true);
NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);

// Open a keyspace instance
Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
Astyanax	
  Query	
  Example	
  
Paginate	
  through	
  all	
  columns	
  in	
  a	
  row	
  
ColumnList<String>	
  columns;	
  
int	
  pageize	
  =	
  10;	
  
try	
  {	
  
	
  	
  	
  	
  RowQuery<String,	
  String>	
  query	
  =	
  keyspace	
  
	
  	
  	
  	
  	
  	
  	
  	
  .prepareQuery(CF_STANDARD1)	
  
	
  	
  	
  	
  	
  	
  	
  	
  .getKey("A")	
  
	
  	
  	
  	
  	
  	
  	
  	
  .setIsPaginaLng()	
  
	
  	
  	
  	
  	
  	
  	
  	
  .withColumnRange(new	
  RangeBuilder().setMaxSize(pageize).build());	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  while	
  (!(columns	
  =	
  query.execute().getResult()).isEmpty())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (Column<String>	
  c	
  :	
  columns)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
}	
  catch	
  (ConnecLonExcepLon	
  e)	
  {	
  
} 	
  	
  
	
  
Data	
  MigraLon	
  to	
  Cassandra	
  
Distributed	
  Key-­‐Value	
  Stores	
  
•  Cloud	
  has	
  many	
  key-­‐value	
  data	
  stores	
  
    –  More	
  complex	
  to	
  keep	
  track	
  of,	
  do	
  backups	
  etc.	
  
    –  Each	
  store	
  is	
  much	
  simpler	
  to	
  administer	
   DBA	
  
    –  Joins	
  take	
  place	
  in	
  java	
  code	
  
•  No	
  schema	
  to	
  change,	
  no	
  scheduled	
  downLme	
  
•  Latency	
  for	
  typical	
  queries	
  
    –  Memcached	
  is	
  dominated	
  by	
  network	
  latency	
  <1ms	
  
    –  Cassandra	
  takes	
  a	
  few	
  milliseconds	
  
    –  SimpleDB	
  replicaLon	
  and	
  REST	
  auth	
  overheads	
  >10ms	
  
MulA-­‐Regional	
  Data	
  ReplicaAon	
  
•  IR	
  Framework	
  –	
  Datacenter	
  Item	
  Replicator	
  
    –  Built	
  in	
  2009,	
  first	
  step	
  to	
  the	
  cloud	
  
    –  Oracle	
  to	
  SimpleDB	
  or	
  Cassandra	
  via	
  poll	
  and	
  push	
  
    –  Return	
  updates	
  to	
  Oracle	
  via	
  SQS	
  message	
  queue	
  
•  SimpleDB	
  or	
  S3	
  to	
  Cassandra	
  
    –  Data	
  migraLon	
  tool	
  for	
  global	
  Ne9lix	
  
•  Global	
  SimpleDB	
  and	
  S3	
  ReplicaLon	
  
    –  Cross	
  region	
  async	
  updates	
  USA	
  to	
  Europe	
  
TransiAonal	
  Steps	
  
•  BidirecLonal	
  ReplicaLon	
  
   –  Oracle	
  to	
  SimpleDB	
  
   –  Queued	
  reverse	
  path	
  using	
  SQS	
  
   –  Backups	
  remain	
  in	
  Datacenter	
  via	
  Oracle	
  
•  New	
  Cloud-­‐Only	
  Data	
  Sources	
  
   –  Cassandra	
  based	
  
   –  No	
  replicaLon	
  to	
  Datacenter	
  
   –  Backups	
  performed	
  in	
  the	
  cloud	
  
API	
  
AWS	
  EC2	
  
                                            Front	
  End	
  Load	
  Balancer	
  
             Discovery	
  
              Service	
                               API	
  Proxy	
                              API	
  etc.	
  

                                                   Load	
  Balancer	
  


          Component	
                                      API	
               SQS	
  
           Services	
                                                                           Oracl
                                                                                                 e	
  
                                                                                                 Oracle	
  
                                                                                                       Oracle	
  
Cassandra	
             memcached	
                                            ReplicaLon	
  
                                                            memcached	
  
           EC2	
  
         Internal	
  
           Disks	
  

                                                                                                Ne+lix	
  
                                   S3	
                                                         Data	
  Center	
  
                                                                         SimpleDB	
  
Cu|ng	
  the	
  Umbilical	
  
•  TransiLon	
  Oracle	
  Data	
  Sources	
  to	
  Cassandra	
  
    –  Offload	
  Datacenter	
  Oracle	
  hardware	
  
    –  Free	
  up	
  capacity	
  for	
  growth	
  of	
  remaining	
  services	
  
•  TransiLon	
  SimpleDB+Memcached	
  to	
  Cassandra	
  
    –  Primary	
  data	
  sources	
  that	
  need	
  backup	
  
    –  Keep	
  simplest	
  small	
  use	
  cases	
  for	
  now	
  
•  New	
  challenges	
  
    –  Backup,	
  restore,	
  archive,	
  business	
  conLnuity	
  
    –  Business	
  Intelligence	
  integraLon	
  
API	
  
AWS	
  EC2	
  
                                   Front	
  End	
  Load	
  Balancer	
  
            Discovery	
  
             Service	
                        API	
  Proxy	
  

                                          Load	
  Balancer	
  


          Component	
                             API	
  
           Services	
  



                 memcached	
                  Cassandra	
  
                                                              EC2	
  
                                                            Internal	
  
                                                              Disks	
  

                                 Backup	
  
                   S3	
  
                                                                           SimpleDB	
  
High	
  Availability	
  
•  Cassandra	
  stores	
  3	
  local	
  copies,	
  1	
  per	
  zone	
  
       –  Synchronous	
  access,	
  durable,	
  highly	
  available	
  
       –  Read/Write	
  One	
  fastest,	
  least	
  consistent	
  -­‐	
  ~1ms	
  
       –  Read/Write	
  Quorum	
  2	
  of	
  3,	
  consistent	
  -­‐	
  ~3ms	
  
•  AWS	
  Availability	
  Zones	
  
       –  Separate	
  buildings	
  
       –  Separate	
  power	
  etc.	
  
       –  Fairly	
  close	
  together	
  
	
  
Cassandra	
  Write	
  Data	
  Flows	
  
                         Single	
  Region,	
  MulLple	
  Availability	
  Zone	
  

                                                              Cassandra	
  
                                                              • Disks	
  
                                                              • Zone	
  A	
  
                                                             2	
                 2	
  
                                                                       4	
   2	
  
1.  Client	
  Writes	
  to	
  any	
     Cassandra	
  3	
                                 3	
  
                                                                                          Cassandra	
         If	
  a	
  node	
  goes	
  offline,	
  
    Cassandra	
  Node	
                 • Disks	
   5                                     • Disks	
   5	
     hinted	
  handoff	
  
2.  Coordinator	
  Node	
               • Zone	
  C	
                 1                   • Zone	
  A	
       completes	
  the	
  write	
  
    replicates	
  to	
  nodes	
                                                                               when	
  the	
  node	
  comes	
  
    and	
  Zones	
                                                                                            back	
  up.	
  
3.  Nodes	
  return	
  ack	
  to	
                           Clients	
                                        	
  
    coordinator	
                                                                                             Requests	
  can	
  choose	
  to	
  
4.  Coordinator	
  returns	
                                                                3	
               wait	
  for	
  one	
  node,	
  a	
  
                                        Cassandra	
                                       Cassandra	
  
    ack	
  to	
  client	
               • Disks	
                                         • Disks	
   5	
     quorum,	
  or	
  all	
  nodes	
  to	
  
5.  Data	
  wri=en	
  to	
              • Zone	
  C	
                                     • Zone	
  B	
       ack	
  the	
  write	
  
    internal	
  commit	
  log	
                                                                               	
  
    disk	
                                                    Cassandra	
                                     SSTable	
  disk	
  writes	
  and	
  
                                                              • Disks	
  
                                                              • Zone	
  B	
  
                                                                                                              compacLons	
  occur	
  
                                                                                                              asynchronously	
  
Data	
  Flows	
  for	
  MulL-­‐Region	
  Writes	
  
                                    Consistency	
  Level	
  =	
  Local	
  Quorum	
  

1.  Client	
  Writes	
  to	
  any	
                                                If	
  a	
  node	
  or	
  region	
  goes	
  offline,	
  hinted	
  handoff	
  
    Cassandra	
  Node	
                                                            completes	
  the	
  write	
  when	
  the	
  node	
  comes	
  back	
  up.	
  
2.  Coordinator	
  node	
  replicates	
                                            Nightly	
  global	
  compare	
  and	
  repair	
  jobs	
  ensure	
  
    to	
  other	
  nodes	
  Zones	
  and	
                                         everything	
  stays	
  consistent.	
  
    regions	
  
3.  Local	
  write	
  acks	
  returned	
  to	
  
    coordinator	
                                                                                                             100+ms	
  latency	
  
                                                                                    Cassandra	
  
                                                                                                       2                                                          7	
  
4.  Client	
  gets	
  ack	
  when	
  2	
  of	
  3	
  
                                                                                                                                                                  Cassandra	
  
                                                                                    •  Disks	
                                                                    •  Disks	
   8	
  
                                                                                    2	
           2	
  
                                                                                    •  Zone	
  A	
  
                                                                                          4	
   2	
                                                               6	
   6	
  
                                                                                                                                                                  •  Zone	
  A	
  

    local	
  nodes	
  are	
  commi=ed	
                 Cassandra	
  
                                                                           3	
                              3	
  
                                                                                                           Cassandra	
                            7	
  
                                                                                                                                               Cassandra	
                             Cassandra	
  
                                                                  5	
                                                         5	
  
5.  Data	
  wri=en	
  to	
  internal	
                                                                                                                    8	
  
                                                        •  Disks	
                                         •  Disks	
                          •  Disks	
                              •  Disks	
  
                                                        •  Zone	
  C	
                                     •  Zone	
  A	
                      •  Zone	
  C	
                          •  Zone	
  A	
  
                                                                                             1	
  
    commit	
  log	
  disks	
                                                         US	
                                                                           EU	
  
6.  When	
  data	
  arrives,	
  remote	
                                           Clients	
                                                                      Clients	
  
                                                        Cassandra	
                                              3	
  
                                                                                                           Cassandra	
                         Cassandra	
                             7	
  
                                                                                                                                                                                       Cassandra	
  
    node	
  replicates	
  data	
                        •  Disks	
  
                                                        •  Zone	
  C	
  
                                                                                                           •  Disks	
  
                                                                                                           •  Zone	
  B	
     5	
  
                                                                                                                                               •  Disks	
  
                                                                                                                                               •  Zone	
  C	
  
                                                                                                                                                                                       •  Disks	
  
                                                                                                                                                                                       •  Zone	
  B	
   8	
  

7.  Ack	
  direct	
  to	
  source	
  region	
                                       Cassandra	
                                                                    Cassandra	
  

    coordinator	
  
                                                                                    •  Disks	
                                                                     •  Disks	
  
                                                                                    •  Zone	
  B	
                                                                 •  Zone	
  B	
  



8.  Remote	
  copies	
  wri=en	
  to	
  
    commit	
  log	
  disks	
  
Remote	
  Copies	
  
•  Cassandra	
  duplicates	
  across	
  AWS	
  regions	
  
    –  Asynchronous	
  write,	
  replicates	
  at	
  desLnaLon	
  
    –  Doesn’t	
  directly	
  affect	
  local	
  read/write	
  latency	
  
•  Global	
  Coverage	
  
    –  Business	
  agility	
  
    –  Follow	
  AWS…	
  
•  Local	
  Access	
                                        3
                                                        3
    –  Be=er	
  latency	
               3
                                                                            3
    –  Fault	
  IsolaLon	
  
    	
  
Cassandra	
  Backup	
  
•  Full	
  Backup	
                                                                      Cassandra	
  

                                                                  Cassandra	
                                   Cassandra	
  

    –  Time	
  based	
  snapshot	
  
    –  SSTable	
  compress	
  -­‐>	
  S3	
        Cassandra	
                                                                   Cassandra	
  




•  Incremental	
                                                                           S3	
  
                                                                                         Backup	
  
                                               Cassandra	
                                                                         Cassandra	
  

    –  SSTable	
  write	
  triggers	
  
       compressed	
  copy	
  to	
  S3	
                  Cassandra	
                                                     Cassandra	
  



                                                                             Cassandra	
             Cassandra	
  
Cassandra	
  Restore	
  
•  Full	
  Restore	
                                                                   Cassandra	
  

                                                                Cassandra	
                                   Cassandra	
  

    –  Replace	
  previous	
  data	
  
•  New	
  Ring	
  from	
  Backup	
              Cassandra	
                                                                   Cassandra	
  




    –  New	
  name	
  old	
  data	
                                                      S3	
  
                                                                                       Backup	
  
                                             Cassandra	
                                                                         Cassandra	
  

•  Scripted	
  
    –  Create	
  new	
  instances	
                    Cassandra	
                                                     Cassandra	
  



    –  Parallel	
  load	
  -­‐	
  fast	
                                   Cassandra	
             Cassandra	
  
Cassandra	
  Online	
  AnalyLcs	
  
•  Brisk	
  =	
  Hadoop	
  +	
  Cass	
                                                   Cassandra	
  

                                                                 Brisk	
                                        Cassandra	
  

    –  Use	
  split	
  Brisk	
  ring	
  
    –  Size	
  each	
  separately	
              Brisk	
                                                                        Cassandra	
  




•  Direct	
  Access	
                                                                      S3	
  
                                                                                         Backup	
  
                                           Cassandra	
                                                                             Cassandra	
  

    –  Keyspaces	
  
    –  Hive/Pig/Map-­‐Reduce	
                       Cassandra	
                                                         Cassandra	
  


    –  Hdfs	
  as	
  a	
  keyspace	
                                         Cassandra	
             Cassandra	
  


    –  Distributed	
  namenode	
  
Cassandra	
  Archive	
  
                     Appropriate	
  level	
  of	
  paranoia	
  needed…                	
  
•  Archive	
  could	
  be	
  un-­‐readable	
  
     –  Restore	
  S3	
  backups	
  weekly	
  from	
  prod	
  to	
  test	
  

•  Archive	
  could	
  be	
  stolen	
  
     –  PGP	
  Encrypt	
  archive	
  

•  AWS	
  East	
  Region	
  could	
  have	
  a	
  problem	
  
     –  Copy	
  data	
  to	
  AWS	
  West	
  

•  ProducLon	
  AWS	
  Account	
  could	
  have	
  an	
  issue	
  
     –  Separate	
  Archive	
  account	
  with	
  no-­‐delete	
  S3	
  ACL	
  

•  AWS	
  S3	
  could	
  have	
  a	
  global	
  problem	
  
     –  Create	
  an	
  extra	
  copy	
  on	
  a	
  different	
  cloud	
  vendor	
  
Extending	
  to	
  MulL-­‐Region	
  
                         In	
  producLon	
  last	
  week	
  for	
  UK/Eire	
  support!	
  


1.    Create	
  cluster	
  in	
  EU	
                                      Take	
  a	
  Boeing	
  737	
  on	
  a	
  domesLc	
  flight,	
  upgrade	
  it	
  to	
  
                                                                           a	
  747	
  by	
  adding	
  more	
  engines	
  and	
  fly	
  it	
  to	
  Europe	
  
2.    Backup	
  US	
  cluster	
  to	
  S3	
                                without	
  landing	
  it	
  on	
  the	
  way…	
  
3.    Restore	
  backup	
  in	
  EU	
  
4.    Local	
  repair	
  EU	
  cluster	
  
5.    Global	
  repair/join	
  
                                                                             Cassandra	
                           100+ms	
  latency	
                    Cassandra	
        1	
  
                                                                             •  Disks	
                                                                   •  Disks	
  
                                                                             •  Zone	
  A	
                                                               •  Zone	
  A	
  


                                                Cassandra	
                                     Cassandra	
                         Cassandra	
                                Cassandra	
  
                                                •  Disks	
                                      •  Disks	
                          •  Disks	
                                 •  Disks	
  
                                                •  Zone	
  C	
                                  •  Zone	
  A	
                      •  Zone	
  C	
                             •  Zone	
  A	
  


                                                                             US	
                                          5	
                             EU	
  
                                                                           Clients	
                                                                     Clients	
  
                                                Cassandra	
                                     Cassandra	
                         Cassandra	
                                Cassandra	
  
                                                •  Disks	
                                      •  Disks	
                          •  Disks	
                                 •  Disks	
  
                                                •  Zone	
  C	
                                  •  Zone	
  B	
                      •  Zone	
  C	
                             •  Zone	
  B	
  


                                                                             Cassandra	
                                                                  Cassandra	
  
                                                                             •  Disks	
                                                                   •  Disks	
  
                                                                             •  Zone	
  B	
  
                                                                                                                                                 3	
      •  Zone	
  B	
  
                                                                                                                                                                                 4	
  
                                                                   2	
  
                                                                                   S3	
  
Tools	
  and	
  AutomaLon	
  
•  Developer	
  and	
  Build	
  Tools	
  
      –  Jira,	
  Perforce,	
  Eclipse,	
  Jenkins,	
  Ivy,	
  ArLfactory	
  
      –  Builds,	
  creates	
  .war	
  file,	
  .rpm,	
  bakes	
  AMI	
  and	
  launches	
  

•  Custom	
  Ne9lix	
  ApplicaLon	
  Console	
  
      –  AWS	
  Features	
  at	
  Enterprise	
  Scale	
  (hide	
  the	
  AWS	
  security	
  keys!)	
  
      –  Auto	
  Scaler	
  Group	
  is	
  unit	
  of	
  deployment	
  to	
  producLon	
  

•  Open	
  Source	
  +	
  Support	
  
      –  Apache,	
  Tomcat,	
  Cassandra,	
  Hadoop,	
  OpenJDK,	
  CentOS	
  
      –  Datastax	
  support	
  for	
  Cassandra,	
  AWS	
  support	
  for	
  Hadoop	
  via	
  EMR	
  

•  Monitoring	
  Tools	
  
      –  Datastax	
  Opscenter	
  for	
  monitoring	
  Cassandra	
  
      –  AppDynamics	
  –	
  Developer	
  focus	
  for	
  cloud	
  h=p://appdynamics.com	
  
Developer	
  MigraLon	
  
•  Detailed	
  SQL	
  to	
  NoSQL	
  TransiLon	
  Advice	
  
   –  Sid	
  Anand	
  	
  -­‐	
  QConSF	
  Nov	
  5th	
  –	
  Ne9lix’	
  TransiLon	
  
      to	
  High	
  Availability	
  Storage	
  Systems	
  
   –  Blog	
  -­‐	
  h=p://pracLcalcloudcompuLng.com/	
  
   –  Download	
  Paper	
  PDF	
  -­‐	
  h=p://bit.ly/bhOTLu	
  
•  Mark	
  Atwood,	
  "Guide	
  to	
  NoSQL,	
  redux”	
  
   –  YouTube	
  h=p://youtu.be/zAbFRiyT3LU	
  
Cloud	
  OperaLons	
  

   Cassandra	
  Use	
  Cases	
  
Model	
  Driven	
  Architecture	
  
Performance	
  and	
  Scalability	
  
Cassandra	
  Use	
  Cases	
  
•  Key	
  by	
  Customer	
  –	
  Cross-­‐region	
  clusters	
  
     –  Many	
  app	
  specific	
  Cassandra	
  clusters,	
  read-­‐intensive	
  
     –  Keys+Rows	
  in	
  memory	
  using	
  m2.4xl	
  Instances	
  

•  Key	
  by	
  Customer:Movie	
  –	
  e.g.	
  Viewing	
  History	
  
     –  Growing	
  fast,	
  write	
  intensive	
  –	
  m1.xl	
  instances	
  
     –  Keys	
  cached	
  in	
  memory,	
  one	
  cluster	
  per	
  region	
  

•  Large	
  scale	
  data	
  logging	
  –	
  lots	
  of	
  writes	
  
     –  Column	
  data	
  expires	
  a6er	
  Lme	
  period	
  
     –  Distributed	
  counters,	
  one	
  cluster	
  per	
  region	
  
Model	
  Driven	
  Architecture	
  
•  Datacenter	
  PracLces	
  
   –  Lots	
  of	
  unique	
  hand-­‐tweaked	
  systems	
  
   –  Hard	
  to	
  enforce	
  pa=erns	
  

•  Model	
  Driven	
  Cloud	
  Architecture	
  
   –  Perforce/Ivy/Jenkins	
  based	
  builds	
  for	
  everything	
  
   –  Every	
  producLon	
  instance	
  is	
  a	
  pre-­‐baked	
  AMI	
  
   –  Every	
  applicaLon	
  is	
  managed	
  by	
  an	
  Autoscaler	
  

                       Every	
  change	
  is	
  a	
  new	
  AMI	
  
Chaos	
  Monkey	
  
•  Make	
  sure	
  systems	
  are	
  resilient	
  
    –  Allow	
  any	
  instance	
  to	
  fail	
  without	
  customer	
  impact	
  
•  Chaos	
  Monkey	
  hours	
  
    –  Monday-­‐Thursday	
  9am-­‐3pm	
  random	
  instance	
  kill	
  
•  ApplicaLon	
  configuraLon	
  opLon	
  
    –  Apps	
  now	
  have	
  to	
  opt-­‐out	
  from	
  Chaos	
  Monkey	
  
•  Computers	
  (Datacenter	
  or	
  AWS)	
  randomly	
  die	
  
    –  Fact	
  of	
  life,	
  but	
  too	
  infrequent	
  to	
  test	
  resiliency	
  
AppDynamics	
  Monitoring	
  of	
  Cassandra	
  –	
  AutomaLc	
  Discovery	
  
Scalability	
  TesLng	
  
•  Cloud	
  Based	
  TesLng	
  –	
  fricLonless,	
  elasLc	
  
    –  Create/destroy	
  any	
  sized	
  cluster	
  in	
  minutes	
  
    –  Many	
  test	
  scenarios	
  run	
  in	
  parallel	
  

•  Test	
  Scenarios	
  
    –  Internal	
  app	
  specific	
  tests	
  
    –  Simple	
  “stress”	
  tool	
  provided	
  with	
  Cassandra	
  

•  Scale	
  test,	
  keep	
  making	
  the	
  cluster	
  bigger	
  
    –  Check	
  that	
  tooling	
  and	
  automaLon	
  works…	
  
    –  How	
  many	
  ten	
  column	
  row	
  writes/sec	
  can	
  we	
  do?	
  
<DrEvil>ONE	
  MILLION</DrEvil>	
  
Scale-­‐Up	
  Linearity	
  
                        Client	
  Writes/s	
  by	
  node	
  count	
  –	
  ReplicaAon	
  Factor	
  =	
  3	
  
1200000	
  
                                                                                                   1099837	
  
1000000	
  

 800000	
  

 600000	
  
                                                              537172	
  
 400000	
                                        366828	
  

 200000	
                           174373	
  

        0	
  
                0	
             50	
         100	
        150	
            200	
     250	
        300	
          350	
  
Global Netflix Platform
Global Netflix Platform
Per	
  Node	
  AcLvity	
  
          Per	
  Node	
               48	
  Nodes	
         96	
  Nodes	
         144	
  Nodes	
           288	
  Nodes	
  
Per	
  Server	
  Writes/s	
           10,900	
  w/s	
       11,460	
  w/s	
          11,900	
  w/s	
            11,456	
  w/s	
  
Mean	
  Server	
  Latency	
            0.0117	
  ms	
        0.0134	
  ms	
           0.0148	
  ms	
             0.0139	
  ms	
  
Mean	
  CPU	
  %Busy	
                      74.4	
  %	
           75.4	
  %	
              72.5	
  %	
                81.5	
  %	
  
Disk	
  Read	
                        5,600	
  KB/s	
       4,590	
  KB/s	
          4,060	
  KB/s	
            4,280	
  KB/s	
  
Disk	
  Write	
                      12,800	
  KB/s	
   11,590	
  KB/s	
            10,380	
  KB/s	
           10,080	
  KB/s	
  
Network	
  Read	
                    22,460	
  KB/s	
   23,610	
  KB/s	
            21,390	
  KB/s	
           23,640	
  KB/s	
  
Network	
  Write	
                   18,600	
  KB/s	
   19,600	
  KB/s	
            17,810	
  KB/s	
           19,770	
  KB/s	
  


           Node	
  specificaLon	
  –	
  Xen	
  Virtual	
  Images,	
  AWS	
  US	
  East,	
  three	
  zones	
  
           •  Cassandra	
  0.8.6,	
  CentOS,	
  SunJDK6	
  
           •  AWS	
  EC2	
  m1	
  Extra	
  Large	
  –	
  Standard	
  price	
  $	
  0.68/Hour	
  
           •  15	
  GB	
  RAM,	
  4	
  Cores,	
  1Gbit	
  network	
  
           •  4	
  internal	
  disks	
  (total	
  1.6TB,	
  striped	
  together,	
  md,	
  XFS)	
  
Time	
  is	
  Money	
  
                                   48	
  nodes	
        96	
  nodes	
                  144	
  nodes	
                      288	
  nodes	
  
Writes	
  Capacity	
              174373	
  w/s	
       366828	
  w/s	
                   537172	
  w/s	
                1,099,837	
  w/s	
  
Storage	
  Capacity	
                  12.8	
  TB	
           25.6	
  TB	
                         38.4	
  TB	
                        76.8	
  TB	
  
Nodes	
  Cost/hr	
                      $32.64	
                $65.28	
                            $97.92	
                          $195.84	
  
Test	
  Driver	
  Instances	
                  10	
                      20	
                                30	
                               60	
  
Test	
  Driver	
  Cost/hr	
             $20.00	
                $40.00	
                            $60.00	
                          $120.00	
  
Cross	
  AZ	
  Traffic	
                 5	
  TB/hr	
         10	
  TB/hr	
                       15	
  TB/hr	
                       301	
  TB/hr	
  
Traffic	
  Cost/10min	
                     $8.33	
               $16.66	
                            $25.00	
                            $50.00	
  
Setup	
  DuraLon	
                15	
  minutes	
       22	
  minutes	
                    31	
  minutes	
                    662	
  minutes	
  
AWS	
  Billed	
  DuraLon	
                    1hr	
                    1hr	
                              1	
  hr	
                          2	
  hr	
  
Total	
  Test	
  Cost	
                 $60.97	
             $121.94	
                           $182.92	
                            $561.68	
  
                                                         1	
  EsLmate	
  two	
  thirds	
  of	
  total	
  network	
  traffic	
  	
  
                                                         2	
  Workaround	
  for	
  a	
  tooling	
  bug	
  slowed	
  setup	
  
Takeaway	
  
                                    	
  
 Ne<lix	
  has	
  built	
  and	
  deployed	
  a	
  scalable	
  global	
  
                    Pla<orm	
  as	
  a	
  Service.	
  
                                    	
  
Also,	
  benchmarking	
  in	
  the	
  cloud	
  is	
  fast,	
  cheap	
  and	
  
                                  scalable	
  
                                    	
  
             h=p://www.linkedin.com/in/adriancockcro6	
  
                     @adrianco	
  #ne9lixcloud	
  
                     acockcro6@ne9lix.com	
  

More Related Content

What's hot

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
Amazon Web Services
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
Martin Danielsson
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
craigbox
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Elastic  Load Balancing Deep Dive - AWS Online Tech TalkElastic  Load Balancing Deep Dive - AWS Online Tech Talk
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Amazon Web Services
 
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
AWSKRUG - AWS한국사용자모임
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
Amazon Web Services
 
Amazon EKS Deep Dive
Amazon EKS Deep DiveAmazon EKS Deep Dive
Amazon EKS Deep Dive
Andrzej Komarnicki
 
AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3) AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3)
zekeLabs Technologies
 
Deploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control TowerDeploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control Tower
Amazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
Amazon Web Services
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - Presentation
TinarivosoaAbaniaina
 
Intro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute ServicesIntro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute Services
Amazon Web Services
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
Rishabh Indoria
 
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
Amazon Web Services
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
Amazon Web Services
 
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon Web Services Korea
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 

What's hot (20)

Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
Elastic  Load Balancing Deep Dive - AWS Online Tech TalkElastic  Load Balancing Deep Dive - AWS Online Tech Talk
Elastic Load Balancing Deep Dive - AWS Online Tech Talk
 
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
Amazon Timestream 시계열 데이터 전용 DB 소개 :: 변규현 - AWS Community Day 2019
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
 
Amazon EKS Deep Dive
Amazon EKS Deep DiveAmazon EKS Deep Dive
Amazon EKS Deep Dive
 
AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3) AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3)
 
Deploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control TowerDeploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control Tower
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - Presentation
 
Intro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute ServicesIntro to AWS: Amazon EC2 and Compute Services
Intro to AWS: Amazon EC2 and Compute Services
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
멀티 어카운트 환경의 보안과 가시성을 높이기 위한 전략 - AWS Summit Seoul 2017
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 

Viewers also liked

Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
Tim Morrow
 
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
Jongyoon Choi
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
 

Viewers also liked (6)

Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
 
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 

Similar to Global Netflix Platform

Netflix keynote-adrian-qcon
Netflix keynote-adrian-qconNetflix keynote-adrian-qcon
Netflix keynote-adrian-qcon
Yiwei Ma
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
Adrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
Adrian Cockcroft
 
[Jun AWS 201] Technical Workshop
[Jun AWS 201] Technical Workshop[Jun AWS 201] Technical Workshop
[Jun AWS 201] Technical Workshop
Amazon Web Services Korea
 
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The CloudO'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Media
 
Netflix web-adrian-qcon
Netflix web-adrian-qconNetflix web-adrian-qcon
Netflix web-adrian-qcon
Yiwei Ma
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Adrian Cockcroft
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
Adrian Cockcroft
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
Adrian Cockcroft
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
Sudhir Tonse
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
Acquia
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
Amazon Web Services
 
Débuter sur le cloud AWS
Débuter sur le cloud AWSDébuter sur le cloud AWS
Débuter sur le cloud AWS
Amazon Web Services
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your Startup
Amazon Web Services
 
Fundamentals of Cloud Computing & AWS
Fundamentals of Cloud Computing & AWSFundamentals of Cloud Computing & AWS
Fundamentals of Cloud Computing & AWS
Bhuvaneswari Subramani
 
Innovation at Scale - Top 10 AWS questions when you start
Innovation at Scale - Top 10 AWS questions when you startInnovation at Scale - Top 10 AWS questions when you start
Innovation at Scale - Top 10 AWS questions when you start
Shiva Narayanaswamy
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Tổng quan về AWS cực hay
Tổng quan về AWS cực hayTổng quan về AWS cực hay
Tổng quan về AWS cực hay
Hoa PN Thaycacac
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 

Similar to Global Netflix Platform (20)

Netflix keynote-adrian-qcon
Netflix keynote-adrian-qconNetflix keynote-adrian-qcon
Netflix keynote-adrian-qcon
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
[Jun AWS 201] Technical Workshop
[Jun AWS 201] Technical Workshop[Jun AWS 201] Technical Workshop
[Jun AWS 201] Technical Workshop
 
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The CloudO'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The Cloud
 
Netflix web-adrian-qcon
Netflix web-adrian-qconNetflix web-adrian-qcon
Netflix web-adrian-qcon
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Web Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud PlatformWeb Scale Applications using NeflixOSS Cloud Platform
Web Scale Applications using NeflixOSS Cloud Platform
 
Running High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWSRunning High Availability Websites with Acquia and AWS
Running High Availability Websites with Acquia and AWS
 
Self-Service Supercomputing
Self-Service SupercomputingSelf-Service Supercomputing
Self-Service Supercomputing
 
Débuter sur le cloud AWS
Débuter sur le cloud AWSDébuter sur le cloud AWS
Débuter sur le cloud AWS
 
Scaling the Platform for Your Startup
Scaling the Platform for Your StartupScaling the Platform for Your Startup
Scaling the Platform for Your Startup
 
Fundamentals of Cloud Computing & AWS
Fundamentals of Cloud Computing & AWSFundamentals of Cloud Computing & AWS
Fundamentals of Cloud Computing & AWS
 
Innovation at Scale - Top 10 AWS questions when you start
Innovation at Scale - Top 10 AWS questions when you startInnovation at Scale - Top 10 AWS questions when you start
Innovation at Scale - Top 10 AWS questions when you start
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Tổng quan về AWS cực hay
Tổng quan về AWS cực hayTổng quan về AWS cực hay
Tổng quan về AWS cực hay
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 

More from Adrian Cockcroft

Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Adrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Adrian Cockcroft
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
Adrian Cockcroft
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
Adrian Cockcroft
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
Adrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
Adrian Cockcroft
 

More from Adrian Cockcroft (20)

Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 

Recently uploaded

Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
Priyanka Aash
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 

Recently uploaded (20)

Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
(CISOPlatform Summit & SACON 2024) Cyber Insurance & Risk Quantification.pdf
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 

Global Netflix Platform

  • 1. The  Global  Ne+lix  Pla+orm   A  Large  Scale  Java  oriented  PaaS  running  on  AWS   October  24th,  2011   Adrian  Cockcro6   @adrianco  #ne9lixcloud   h=p://www.linkedin.com/in/adriancockcro6  
  • 2. Ne9lix  Inc.   With  more  than  20  million  streaming  members  in  the   United  States,  Canada  and  La8n  America,  Ne<lix,  Inc.   is  the  world's  leading  Internet  subscrip8on  service  for   enjoying  movies  and  TV  shows.     Interna8onal  Expansion   Ne<lix,  Inc.,  the  leading  global  Internet  movie   subscrip8on  service…  announced  it  will  expand  to  the   United  Kingdom  and  Ireland  in  early  2012.   Source:  h=p://ir.ne9lix.com  
  • 3. The  Global  Ne9lix  Pla9orm   Ne9lix  Cloud  MigraLon   Ne9lix  Pla9orm  Services  and  Interfaces   Highly  Available  and  Globally   Distributed  Data   Scalability  and  Performance  
  • 4. Why  Use  Public  Cloud?  
  • 7. Data  Center   Ne9lix  could  not   build  new   datacenters  fast   enough   Capacity  growth  is  acceleraLng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  XBox  
  • 8. Out-­‐Growing  Data  Center   h=p://techblog.ne9lix.com/2011/02/redesigning-­‐ne9lix-­‐api.html   37x  Growth  Jan   2010-­‐Jan  2011   Datacenter   Capacity  
  • 9. Ne9lix.com  is  now  ~100%  Cloud   A  few  small  back  end  data  sources  sLll  in  progress   All  internaLonal  product  is  cloud  based   USA  specific  logisLcs  remains  in  the  Datacenter   Working  aggressively  on  billing,  PCI  compliance  on  AWS  
  • 10. Ne9lix  Choice  was  AWS  with  our   own  pla9orm  and  tools   Unique  pla9orm  requirements  and   extreme  scale,  agility  and  flexibility  
  • 11. Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  features  and  automaLon   Use  AWS  zones  and  regions  for  high  availability,   scalability  and  global  deployment  
  • 12. But  isn’t  Amazon  a  compeLtor?   Many  products  that  compete  with  Amazon  run  on  AWS   We  are  a  “poster  child”  for  the  AWS  Architecture   Ne9lix  is  one  of  the  biggest  AWS  customers   Strategy  –  turn  compeLtors  into  partners  
  • 13. Could  Ne9lix  use  another  cloud?   Would  be  nice,  we  use  three  interchangeable  CDN  Vendors   But  no-­‐one  else  has  the  scale  and  features  of  AWS   You  have  to  be  this  tall  to  ride  this  ride…   Maybe  in  2-­‐3  years?  
  • 14. We  want  to  use  clouds,   we  don’t  have  Lme  to  build  them   Public  cloud  for  agility  and  scale   We  use  electricity  too,  but  don’t  want  to  build  our  own  power  staLon…   AWS  because  they  are  big  enough  to  allocate  thousands  of  instances  per   hour  when  we  need  to  
  • 15. Ne9lix  Deployed  on  AWS   Content   Logs   Play   WWW   API   CS   Video   InternaLonal   Masters   S3   DRM   Sign-­‐Up   Metadata   CS  lookup   Device   DiagnosLcs   EC2   EMR  Hadoop   CDN  rouLng   Search   Config   &  AcLons   Movie   TV  Movie   Customer   S3   Hive   Bookmarks   Choosing   Choosing   Call  Log   Business   Social   CDNs   Logging   RaLngs   Facebook   CS  AnalyLcs   Intelligence  
  • 16. Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features •  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)   •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaLon  code)   •  EC2  –  ElasLc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraLons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosLng  cloud  instances   –  Region  –  group  of  Availability  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan,  US-­‐Gov   •  ASG  –  Auto  Scaling  Group  (instances  booLng  from  the  same  AMI)   •  S3  –  Simple  Storage  Service  (h=p  access)   •  EBS  –  ElasLc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)   •  RDS  –  RelaLonal  Database  Service  (managed  MySQL  master  and  slaves)   •  SDB  –  Simple  Data  Base  (hosted  h=p  based  NoSQL  data  store)   •  SQS  –  Simple  Queue  Service  (h=p  based  message  queue)   •  SNS  –  Simple  NoLficaLon  Service  (h=p  and  email  based  topics  and  messages)   •  EMR  –  ElasLc  Map  Reduce  (automaLcally  managed  Hadoop  cluster)   •  ELB  –  ElasLc  Load  Balancer   •  EIP  –  ElasLc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)   •  VPC  –  Virtual  Private  Cloud  (extension  of  enterprise  datacenter  network  into  cloud)   •  IAM  –  IdenLty  and  Access  Management  (fine  grain  role  based  security  keys)  
  • 17. Boot  Camp   •  One  day  “Ne9lix  Cloud  Training”  class   –  Has  been  run  5  Lmes  for  20-­‐45  people  each  Lme   •  Half  day  of  presentaLons   •  Half  day  hands-­‐on   –  Create  your  own  hello  world  app   –  Launch  in  AWS  test  account   –  Login  to  your  cloud  instances   –  Find  monitoring  data  on  your  cloud  instances   –  Connect  to  Cassandra  and  read/write  data  
  • 18. Ne9lix  Built  a  PaaS!   •  Ne9lix  Cloud  Systems  team  (50+  rock-­‐stars  :)   –  VP  Cloud  Systems  (Yury  Izrailevsky)   –  Site  Reliability  Engineering  (@jedberg)  Hiring++!   –  Cloud  Performance  (Denis  Sheahan)   –  Database  Engineering  -­‐  Cassandra+MySQL  (@r39132)     –  Pla9orm  Engineering  –  Astyanax  (Eran  Landau)   –  Cloud  Tools  Engineering  –  Jenkins  (@cquinn)   –  Cloud  SoluLons  Team  –  Monkeys  (@atseitlin)   –  Security  (Jason  Chan)   –  Architecture  (@adrianco)  
  • 19. Ne9lix  Global  PaaS   •  Architecture  Features  and  Overview   •  Portals  and  Explorers   •  Pla9orm  Services   •  Pla9orm  APIs   •  Pla9orm  Frameworks   •  Persistence   •  Scalability  Benchmark  
  • 20. Global  PaaS?   Toys  are  nice,  but  this  is  the  real  thing…   •  Supports  all  AWS  Availability  Zones  and  Regions   •  Supports  mulLple  AWS  accounts  {test,  prod,  etc.}   •  Cross  Region/Acct  Data  ReplicaLon  and  Archiving   •  InternaLonalized,  Localized  and  GeoIP  rouLng   •  Security  is  fine  grain,  dynamic  AWS  keys   •  Autoscaling  to  thousands  of  instances   •  Monitoring  for  millions  of  metrics   •  20M+  users  USA,  Canada,  LaLn  America  (UK,  Eire)  
  • 21. Instance  Architecture   Linux  Base  AMI  (currently  Centos  5)   OpLonal   Apache   frontend,   Java  (choice  of  JDK  6  or  7)   memcached,   non-­‐java  apps   Tomcat   AppDynamics   appagent   Monitoring   Log  rotaLon   ApplicaLon  servlet,  base   to  S3   Healthcheck,  status   GC  and  thread   server,  pla9orm,  interface   AppDynamics   servlets,  JMX  interface   dump  logging   jars  for  dependent  services   machineagent   Epic    
  • 22. Security  Architecture   •  Instance  Level  Security  baked  into  base  AMI   –  Login  via  ssh  only  allowed  via  portal   –  Each  app  type  runs  as  its  own  userid  app{test|prod}   •  AWS  Security,  IdenLty  and  Access  Management   –  Each  app  has  its  own  security  group  (firewall  ports)   –  Fine  grain  user  roles  and  resource  ACLs   •  Key  Management   –  AWS  Keys  dynamically  provisioned,  easy  updates   –  High  grade  app  key  management  support  
  • 23. Core  Pla9orm  Frameworks  and  APIs  
  • 24. Portals  and  Explorers   •  Ne9lix  ApplicaLon  Console  (NAC)   –  Primary  AWS  provisioning/config  interface   •  AWS  Usage  Analyzer   –  Breaks  down  costs  by  applicaLon  and  resource   •  SimpleDB  Explorer   –  Browse  domains,  items,  a=ributes,  values   •  Cassandra  Explorer   –  Browse  clusters,  keyspaces,  column  families   •  Base  Server  Explorer   –  Browse  service  endpoints  configuraLon,  perf  
  • 27. AWS  Usage   for  test,  carefully  omi|ng  any  $  numbers…  
  • 30. Pla9orm  Services   •  Discovery  –  service  registry  for  “applicaLons”   •  IntrospecLon  –  Entrypoints   •  Cryptex  –  Dynamic  security  key  management   •  Geo  –  Geographic  IP  lookup   •  Pla9ormservice  –  Dynamic  property  configuraLon   •  LocalizaLon  –  manage  and  lookup  local  translaLons   •  Evcache  –  eccentric  volaLle  (mem)cached   •  Cassandra  –  Persistence   •  Zookeeper  -­‐  CoordinaLon   •  Various  proxies  –  access  to  old  datacenter  stuff  
  • 31. IntrospecLon  -­‐  Entrypoints   •  REST  API  for  tools,  apps,  explorers,  monkeys…   –  E.g.  GET  /REST/v1/instance/$INSTANCE_ID   •  AWS  Resources   –  Autoscaling  Groups,  EIP  Groups,  Instances   •  Ne9lix  PaaS  Resources   –  Discovery  ApplicaLons,  Clusters  of  ASGs,  History  
  • 32. Entrypoints  Queries   MongoDB  is  good  for  low  traffic  complex  queries  against  complex  objects   DescripAon   Range  expression   Find  all  acLve  instances.     all()   Find  all  instances  associated  with  a  group   %(cloudmonkey)   name.   Find  all  instances  associated  with  a   /^cloudmonkey$/discovery()   discovery  group.     Find  all  auto  scale  groups  with  no  instances.   asg(),-­‐has(INSTANCES;asg())   How  many  instances  are  not  in  an  auto   count(all(),-­‐info(eval(INSTANCES;asg())))     scale  group?   What  groups  include  an  instance?   *(i-­‐4e108521)   What  auto  scale  groups  and  elasLc  load   filter(TYPE;asg,elb;*(i-­‐4e108521))   balancers  include  an  instance?   What  instance  has  a  given  public  ip?   filter(PUBLIC_IP;174.129.188.{0..255};all())  
  • 33. Metrics  Framework   •  System  and  ApplicaLon   –  CollecLon,  AggregaLon,  Querying  and  ReporLng   –  Non-­‐blocking  logging,  avoids  log4j  lock  contenLon   –  Chukwa  -­‐>  S3  -­‐>  EMR  -­‐>  Hive   •  Performance,  Robustness,  Monitoring,  Analysis   –  Tracers,  Counters  –  explicit  code  instrumentaLon  log   –  Real  Time  Tracers/Counters   –  SLA  –  service  level  response  Lme  percenLles   –  Epic  (@MonitoredResources)  annotated  JMX  extract   •  Latency  Monkey  Infrastructure   –  Inject  random  delays  into  service  responses  
  • 34. ConfiguraAon  Management   •  Ne9lixConfiguraLon   –  ValidaLon  Framework   –  Sitewide  ProperLes  Explorer   •  Pla9ormService   •  Mapping  Service   •  ZooKeeper  (Curator)   •  InstanceIdenLty  
  • 35. Interprocess  CommunicaAon   •  Discovery  Service  registry  for  “applicaLons”   –  “here  I  am”  call  every  30s,  drop  a6er  3  missed   –  “where  is  everyone”  call   –  Redundant,  distributed,  moving  to  Zookeeper   •  NIWS  –  Ne9lix  Internal  Web  Service  client   –  So6ware  Middle  Tier  Load  Balancer   –  Failure  retry  moves  to  next  instance   –  Many  opLons  for  encoding,  etc.  
  • 36. Security  Key  Management   •  AKMS   –  Dynamic  Key  Management  interface   –  Update  AWS  keys  at  runLme,  no  restart   –  All  keys  stored  securely,  none  on  disk  or  in  AMI   •  Cryptex  -­‐  Flexible  key  store   –  Low  grade  keys  processed  in  client   –  Medium  grade  keys  processed  by  Cryptex  service   –  High  grade  keys  processed  by  hardware  (Ingrian)  
  • 37. AWS  Persistence  Services   •  SimpleDB   –  Got  us  started,  migraLng  to  Cassandra  now   –  NFSDB  -­‐  Instrumented  wrapper  library   –  Domain  and  Item  sharding  (workarounds)   •  S3   –  Upgraded/Instrumented  JetS3t  based  interface   –  Supports  mulLpart  upload  and  large  files   –  Global  S3  endpoint  management  
  • 38. Ne+lix  Pla+orm  Persistence   •  Eccentric  VolaLle  Cache  –  evcache   –  Discovery-­‐aware  memcached  based  backend   –  Client  abstracLons  for  zone  aware  replicaLon   –  OpLon  to  write  to  all  zones,  fast  read  from  local   •  Cassandra   –  Highly  available  and  scalable  (more  later…)   •  MongoDB   –  Complex  object/query  model  for  small  scale  use   •  MySQL   –  Hard  to  scale,  legacy  and  small  relaLonal  models  
  • 39. Aside:  Adrian’s  Rant  on  CAP  Theorem   •  Instances  and  Networks  will  fail   •  Network  failure  =  ParLLon  “P”  is  a  given   •  Distributed  Systems:  two  choices  –  CP  or  AP   •  “Vendor  claims  CA”   –  Usually  they  mean  available  when  instances  fail   •  Master-­‐Slave  =  Consistent  when  ParLLoned   –  You  can’t  write  unless  you  can  see  the  master   •  Quorum  =  Available  when  ParLLoned   –  Writes  proceed,  conflicts  will  be  patched  up  later  
  • 40. Why  Cassandra?   •  We  value  Availability  over  Consistency  –  AP   –  Cassandra  is  a  Java  distributed  systems  toolkit   •  We  have  a  building  full  of  Java  engineers   –  Riak  is  in  Erlang  –  a  blessing  and  a  curse…   •  We  want  FOSS  +  Support   –  Voldemort  doesn’t  have  a  support  model   •  Writes  are  intrinsically  harder  than  reads   –  Hbase  is  opLmized  for  reads,  Cassandra  for  writes   •  We  tested  Cassandra  and  it  works  for  us   –  Step  by  step  into  full  producLon  over  the  last  year  
  • 41. Priam  –  Cassandra  AutomaLon   Coming  soon  to  h=p://github.com/ne9lix   •  Ne9lix  Pla9orm  Tomcat  Code   •  Zero  touch  auto-­‐configuraLon   •  State  management  for  Cassandra  JVM   •  Token  allocaLon  and  assignment   •  Broken  node  auto-­‐replacement   •  Full  and  incremental  backup  to  S3   •  Restore  sequencing  from  S3  
  • 42. Astyanax   Coming  soon  to  h=p://github.com/ne9lix   •  Cassandra  java  client   •  API  abstracLon  on  top  of  Thri6  protocol   •  “Fixed”  ConnecLon  Pool  abstracLon  (vs.  Hector)   –  Round  robin  with  Failover   –  Retry-­‐able  operaLons  not  Led  to  a  connecLon   –  Discovery  integraLon   –  Host  reconnect  (fixed  interval  or  exponenLal  backoff)   –  Token  aware  (in  development)  to  save  a  network  hop   •  Ne9lix  style  configuraLon  (INFLibrary)   •  Batch  mutaLon:  set,  put,  delete,  increment   •  Simplified  use  of  serializers  via  method  overloading  (vs.  Hector)   •  ConnecLonPoolMonitor  interface  for  counters  and  tracers   •  Composite  Column  Names  replacing  deprecated  SuperColumns  
  • 43. IniLalizing  Astyanax   // Configuration either set in code or nfastyanax.properties platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY netflix.environment=test default.astyanax.readConsistency=CL_QUORUM default.astyanax.writeConsistency=CL_QUORUM MyCluster.MyKeyspace.astyanax.servers=127.0.0.1 // Must initialize platform for discovery to work NFLibraryManager.initLibrary(PlatformManager.class, props, false, true); NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false); // Open a keyspace instance Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
  • 44. Astyanax  Query  Example   Paginate  through  all  columns  in  a  row   ColumnList<String>  columns;   int  pageize  =  10;   try  {          RowQuery<String,  String>  query  =  keyspace                  .prepareQuery(CF_STANDARD1)                  .getKey("A")                  .setIsPaginaLng()                  .withColumnRange(new  RangeBuilder().setMaxSize(pageize).build());                                      while  (!(columns  =  query.execute().getResult()).isEmpty())  {                  for  (Column<String>  c  :  columns)  {                  }          }   }  catch  (ConnecLonExcepLon  e)  {   }      
  • 45. Data  MigraLon  to  Cassandra  
  • 46. Distributed  Key-­‐Value  Stores   •  Cloud  has  many  key-­‐value  data  stores   –  More  complex  to  keep  track  of,  do  backups  etc.   –  Each  store  is  much  simpler  to  administer   DBA   –  Joins  take  place  in  java  code   •  No  schema  to  change,  no  scheduled  downLme   •  Latency  for  typical  queries   –  Memcached  is  dominated  by  network  latency  <1ms   –  Cassandra  takes  a  few  milliseconds   –  SimpleDB  replicaLon  and  REST  auth  overheads  >10ms  
  • 47. MulA-­‐Regional  Data  ReplicaAon   •  IR  Framework  –  Datacenter  Item  Replicator   –  Built  in  2009,  first  step  to  the  cloud   –  Oracle  to  SimpleDB  or  Cassandra  via  poll  and  push   –  Return  updates  to  Oracle  via  SQS  message  queue   •  SimpleDB  or  S3  to  Cassandra   –  Data  migraLon  tool  for  global  Ne9lix   •  Global  SimpleDB  and  S3  ReplicaLon   –  Cross  region  async  updates  USA  to  Europe  
  • 48. TransiAonal  Steps   •  BidirecLonal  ReplicaLon   –  Oracle  to  SimpleDB   –  Queued  reverse  path  using  SQS   –  Backups  remain  in  Datacenter  via  Oracle   •  New  Cloud-­‐Only  Data  Sources   –  Cassandra  based   –  No  replicaLon  to  Datacenter   –  Backups  performed  in  the  cloud  
  • 49. API   AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   API  etc.   Load  Balancer   Component   API   SQS   Services   Oracl e   Oracle   Oracle   Cassandra   memcached   ReplicaLon   memcached   EC2   Internal   Disks   Ne+lix   S3   Data  Center   SimpleDB  
  • 50. Cu|ng  the  Umbilical   •  TransiLon  Oracle  Data  Sources  to  Cassandra   –  Offload  Datacenter  Oracle  hardware   –  Free  up  capacity  for  growth  of  remaining  services   •  TransiLon  SimpleDB+Memcached  to  Cassandra   –  Primary  data  sources  that  need  backup   –  Keep  simplest  small  use  cases  for  now   •  New  challenges   –  Backup,  restore,  archive,  business  conLnuity   –  Business  Intelligence  integraLon  
  • 51. API   AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   Load  Balancer   Component   API   Services   memcached   Cassandra   EC2   Internal   Disks   Backup   S3   SimpleDB  
  • 52. High  Availability   •  Cassandra  stores  3  local  copies,  1  per  zone   –  Synchronous  access,  durable,  highly  available   –  Read/Write  One  fastest,  least  consistent  -­‐  ~1ms   –  Read/Write  Quorum  2  of  3,  consistent  -­‐  ~3ms   •  AWS  Availability  Zones   –  Separate  buildings   –  Separate  power  etc.   –  Fairly  close  together    
  • 53. Cassandra  Write  Data  Flows   Single  Region,  MulLple  Availability  Zone   Cassandra   • Disks   • Zone  A   2   2   4   2   1.  Client  Writes  to  any   Cassandra  3   3   Cassandra   If  a  node  goes  offline,   Cassandra  Node   • Disks   5 • Disks   5   hinted  handoff   2.  Coordinator  Node   • Zone  C   1 • Zone  A   completes  the  write   replicates  to  nodes   when  the  node  comes   and  Zones   back  up.   3.  Nodes  return  ack  to   Clients     coordinator   Requests  can  choose  to   4.  Coordinator  returns   3   wait  for  one  node,  a   Cassandra   Cassandra   ack  to  client   • Disks   • Disks   5   quorum,  or  all  nodes  to   5.  Data  wri=en  to   • Zone  C   • Zone  B   ack  the  write   internal  commit  log     disk   Cassandra   SSTable  disk  writes  and   • Disks   • Zone  B   compacLons  occur   asynchronously  
  • 54. Data  Flows  for  MulL-­‐Region  Writes   Consistency  Level  =  Local  Quorum   1.  Client  Writes  to  any   If  a  node  or  region  goes  offline,  hinted  handoff   Cassandra  Node   completes  the  write  when  the  node  comes  back  up.   2.  Coordinator  node  replicates   Nightly  global  compare  and  repair  jobs  ensure   to  other  nodes  Zones  and   everything  stays  consistent.   regions   3.  Local  write  acks  returned  to   coordinator   100+ms  latency   Cassandra   2 7   4.  Client  gets  ack  when  2  of  3   Cassandra   •  Disks   •  Disks   8   2   2   •  Zone  A   4   2   6   6   •  Zone  A   local  nodes  are  commi=ed   Cassandra   3   3   Cassandra   7   Cassandra   Cassandra   5   5   5.  Data  wri=en  to  internal   8   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  Zone  C   •  Zone  A   1   commit  log  disks   US   EU   6.  When  data  arrives,  remote   Clients   Clients   Cassandra   3   Cassandra   Cassandra   7   Cassandra   node  replicates  data   •  Disks   •  Zone  C   •  Disks   •  Zone  B   5   •  Disks   •  Zone  C   •  Disks   •  Zone  B   8   7.  Ack  direct  to  source  region   Cassandra   Cassandra   coordinator   •  Disks   •  Disks   •  Zone  B   •  Zone  B   8.  Remote  copies  wri=en  to   commit  log  disks  
  • 55. Remote  Copies   •  Cassandra  duplicates  across  AWS  regions   –  Asynchronous  write,  replicates  at  desLnaLon   –  Doesn’t  directly  affect  local  read/write  latency   •  Global  Coverage   –  Business  agility   –  Follow  AWS…   •  Local  Access   3 3 –  Be=er  latency   3 3 –  Fault  IsolaLon    
  • 56. Cassandra  Backup   •  Full  Backup   Cassandra   Cassandra   Cassandra   –  Time  based  snapshot   –  SSTable  compress  -­‐>  S3   Cassandra   Cassandra   •  Incremental   S3   Backup   Cassandra   Cassandra   –  SSTable  write  triggers   compressed  copy  to  S3   Cassandra   Cassandra   Cassandra   Cassandra  
  • 57. Cassandra  Restore   •  Full  Restore   Cassandra   Cassandra   Cassandra   –  Replace  previous  data   •  New  Ring  from  Backup   Cassandra   Cassandra   –  New  name  old  data   S3   Backup   Cassandra   Cassandra   •  Scripted   –  Create  new  instances   Cassandra   Cassandra   –  Parallel  load  -­‐  fast   Cassandra   Cassandra  
  • 58. Cassandra  Online  AnalyLcs   •  Brisk  =  Hadoop  +  Cass   Cassandra   Brisk   Cassandra   –  Use  split  Brisk  ring   –  Size  each  separately   Brisk   Cassandra   •  Direct  Access   S3   Backup   Cassandra   Cassandra   –  Keyspaces   –  Hive/Pig/Map-­‐Reduce   Cassandra   Cassandra   –  Hdfs  as  a  keyspace   Cassandra   Cassandra   –  Distributed  namenode  
  • 59. Cassandra  Archive   Appropriate  level  of  paranoia  needed…   •  Archive  could  be  un-­‐readable   –  Restore  S3  backups  weekly  from  prod  to  test   •  Archive  could  be  stolen   –  PGP  Encrypt  archive   •  AWS  East  Region  could  have  a  problem   –  Copy  data  to  AWS  West   •  ProducLon  AWS  Account  could  have  an  issue   –  Separate  Archive  account  with  no-­‐delete  S3  ACL   •  AWS  S3  could  have  a  global  problem   –  Create  an  extra  copy  on  a  different  cloud  vendor  
  • 60. Extending  to  MulL-­‐Region   In  producLon  last  week  for  UK/Eire  support!   1.  Create  cluster  in  EU   Take  a  Boeing  737  on  a  domesLc  flight,  upgrade  it  to   a  747  by  adding  more  engines  and  fly  it  to  Europe   2.  Backup  US  cluster  to  S3   without  landing  it  on  the  way…   3.  Restore  backup  in  EU   4.  Local  repair  EU  cluster   5.  Global  repair/join   Cassandra   100+ms  latency   Cassandra   1   •  Disks   •  Disks   •  Zone  A   •  Zone  A   Cassandra   Cassandra   Cassandra   Cassandra   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  Zone  C   •  Zone  A   US   5   EU   Clients   Clients   Cassandra   Cassandra   Cassandra   Cassandra   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  B   •  Zone  C   •  Zone  B   Cassandra   Cassandra   •  Disks   •  Disks   •  Zone  B   3   •  Zone  B   4   2   S3  
  • 61. Tools  and  AutomaLon   •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArLfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches   •  Custom  Ne9lix  ApplicaLon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producLon   •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop,  OpenJDK,  CentOS   –  Datastax  support  for  Cassandra,  AWS  support  for  Hadoop  via  EMR   •  Monitoring  Tools   –  Datastax  Opscenter  for  monitoring  Cassandra   –  AppDynamics  –  Developer  focus  for  cloud  h=p://appdynamics.com  
  • 62. Developer  MigraLon   •  Detailed  SQL  to  NoSQL  TransiLon  Advice   –  Sid  Anand    -­‐  QConSF  Nov  5th  –  Ne9lix’  TransiLon   to  High  Availability  Storage  Systems   –  Blog  -­‐  h=p://pracLcalcloudcompuLng.com/   –  Download  Paper  PDF  -­‐  h=p://bit.ly/bhOTLu   •  Mark  Atwood,  "Guide  to  NoSQL,  redux”   –  YouTube  h=p://youtu.be/zAbFRiyT3LU  
  • 63. Cloud  OperaLons   Cassandra  Use  Cases   Model  Driven  Architecture   Performance  and  Scalability  
  • 64. Cassandra  Use  Cases   •  Key  by  Customer  –  Cross-­‐region  clusters   –  Many  app  specific  Cassandra  clusters,  read-­‐intensive   –  Keys+Rows  in  memory  using  m2.4xl  Instances   •  Key  by  Customer:Movie  –  e.g.  Viewing  History   –  Growing  fast,  write  intensive  –  m1.xl  instances   –  Keys  cached  in  memory,  one  cluster  per  region   •  Large  scale  data  logging  –  lots  of  writes   –  Column  data  expires  a6er  Lme  period   –  Distributed  counters,  one  cluster  per  region  
  • 65. Model  Driven  Architecture   •  Datacenter  PracLces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa=erns   •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producLon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaLon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
  • 66. Chaos  Monkey   •  Make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact   •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill   •  ApplicaLon  configuraLon  opLon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey   •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  
  • 67. AppDynamics  Monitoring  of  Cassandra  –  AutomaLc  Discovery  
  • 68. Scalability  TesLng   •  Cloud  Based  TesLng  –  fricLonless,  elasLc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel   •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra   •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaLon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
  • 70. Scale-­‐Up  Linearity   Client  Writes/s  by  node  count  –  ReplicaAon  Factor  =  3   1200000   1099837   1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
  • 73. Per  Node  AcLvity   Per  Node   48  Nodes   96  Nodes   144  Nodes   288  Nodes   Per  Server  Writes/s   10,900  w/s   11,460  w/s   11,900  w/s   11,456  w/s   Mean  Server  Latency   0.0117  ms   0.0134  ms   0.0148  ms   0.0139  ms   Mean  CPU  %Busy   74.4  %   75.4  %   72.5  %   81.5  %   Disk  Read   5,600  KB/s   4,590  KB/s   4,060  KB/s   4,280  KB/s   Disk  Write   12,800  KB/s   11,590  KB/s   10,380  KB/s   10,080  KB/s   Network  Read   22,460  KB/s   23,610  KB/s   21,390  KB/s   23,640  KB/s   Network  Write   18,600  KB/s   19,600  KB/s   17,810  KB/s   19,770  KB/s   Node  specificaLon  –  Xen  Virtual  Images,  AWS  US  East,  three  zones   •  Cassandra  0.8.6,  CentOS,  SunJDK6   •  AWS  EC2  m1  Extra  Large  –  Standard  price  $  0.68/Hour   •  15  GB  RAM,  4  Cores,  1Gbit  network   •  4  internal  disks  (total  1.6TB,  striped  together,  md,  XFS)  
  • 74. Time  is  Money   48  nodes   96  nodes   144  nodes   288  nodes   Writes  Capacity   174373  w/s   366828  w/s   537172  w/s   1,099,837  w/s   Storage  Capacity   12.8  TB   25.6  TB   38.4  TB   76.8  TB   Nodes  Cost/hr   $32.64   $65.28   $97.92   $195.84   Test  Driver  Instances   10   20   30   60   Test  Driver  Cost/hr   $20.00   $40.00   $60.00   $120.00   Cross  AZ  Traffic   5  TB/hr   10  TB/hr   15  TB/hr   301  TB/hr   Traffic  Cost/10min   $8.33   $16.66   $25.00   $50.00   Setup  DuraLon   15  minutes   22  minutes   31  minutes   662  minutes   AWS  Billed  DuraLon   1hr   1hr   1  hr   2  hr   Total  Test  Cost   $60.97   $121.94   $182.92   $561.68   1  EsLmate  two  thirds  of  total  network  traffic     2  Workaround  for  a  tooling  bug  slowed  setup  
  • 75. Takeaway     Ne<lix  has  built  and  deployed  a  scalable  global   Pla<orm  as  a  Service.     Also,  benchmarking  in  the  cloud  is  fast,  cheap  and   scalable     h=p://www.linkedin.com/in/adriancockcro6   @adrianco  #ne9lixcloud   acockcro6@ne9lix.com