SlideShare a Scribd company logo
Cloud	
  Architecture	
  at	
  Ne0lix	
  
How	
  Ne0lix	
  Built	
  a	
  Scalable	
  Java	
  oriented	
  PaaS	
  running	
  on	
  AWS	
  



                 SVForum	
  March	
  27th,	
  2012	
  
                      Adrian	
  Cockcro9	
  
                      @adrianco	
  #ne=lixcloud	
  
              h@p://www.linkedin.com/in/adriancockcro9	
  
Adrian	
  Cockcro9	
  
•  Director,	
  Architecture	
  for	
  Cloud	
  Systems,	
  Ne=lix	
  Inc.	
  
      –  Previously	
  Director	
  for	
  PersonalizaOon	
  Pla=orm	
  

•  DisOnguished	
  Availability	
  Engineer,	
  eBay	
  Inc.	
  2004-­‐7	
  
      –  Founding	
  member	
  of	
  eBay	
  Research	
  Labs	
  

•  DisOnguished	
  Engineer,	
  Sun	
  Microsystems	
  Inc.	
  1988-­‐2004	
  
      –    2003-­‐4	
  Chief	
  Architect	
  High	
  Performance	
  Technical	
  CompuOng	
  
      –    2001	
  Author:	
  Capacity	
  Planning	
  for	
  Web	
  Services	
  
      –    1999	
  Author:	
  Resource	
  Management	
  
      –    1995	
  &	
  1998	
  Author:	
  Sun	
  Performance	
  and	
  Tuning	
  
      –    1996	
  Japanese	
  EdiOon	
  of	
  Sun	
  Performance	
  and	
  Tuning	
  
             •  	
  SPARC	
  &	
  Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)	
  


•  More	
  
      –  Twi@er	
  @adrianco	
  –	
  Blog	
  h@p://perfcap.blogspot.com	
  
      –  PresentaOons	
  at	
  h@p://www.slideshare.net/adrianco	
  
Why	
  Ne=lix,	
  Why	
  Cloud,	
  Why	
  
               AWS	
  
               Part	
  1	
  of	
  3	
  
What	
  kind	
  of	
  Cloud?	
  
•  So9ware	
  as	
  a	
  Service	
  –	
  SaaS	
  
     –  Replaces	
  in	
  house	
  applicaOons	
  
     –  Targets	
  end	
  users	
  
•  Pla=orm	
  as	
  a	
  Service	
  –	
  PaaS	
  
     –  Replaces	
  in	
  house	
  operaOons	
  funcOons	
  
     –  Targets	
  developers	
  
•  Infrastructure	
  as	
  a	
  Service	
  –	
  IaaS	
  
     –  Replaces	
  in	
  house	
  datacenter	
  capacity	
  
     –  Targets	
  developers	
  and	
  ITops	
  
What	
  Ne=lix	
  Did	
  
•  Moved	
  to	
  SaaS	
  
    –  Corporate	
  IT	
  –	
  OneLogin,	
  Workday,	
  Box,	
  Evernote…	
  
    –  Tools	
  –	
  Pagerduty,	
  AppDynamics,	
  ElasOc	
  MapReduce	
  
•  Built	
  our	
  own	
  PaaS	
  <-­‐	
  today’s	
  focus	
  
    –  Customized	
  to	
  make	
  our	
  developers	
  producOve	
  
    –  When	
  we	
  started,	
  we	
  had	
  li@le	
  choice	
  
•  Moved	
  incremental	
  capacity	
  to	
  IaaS	
  
    –  No	
  new	
  datacenter	
  space	
  since	
  2008	
  as	
  we	
  grew	
  
    –  Moved	
  our	
  streaming	
  apps	
  to	
  the	
  cloud	
  
Why	
  Use	
  Public	
  Cloud?	
  
Things	
  We	
  Don’t	
  Do	
  
Be@er	
  Business	
  Agility	
  
Data	
  Center	
                   Ne=lix	
  could	
  not	
  
                                      build	
  new	
  
                                   datacenters	
  fast	
  
                                       enough	
  

    Capacity	
  growth	
  is	
  acceleraOng,	
  unpredictable	
  
    Product	
  launch	
  spikes	
  -­‐	
  iPhone,	
  Wii,	
  PS3,	
  Xbox	
  
  InternaOonal	
  –	
  Canada,	
  LaOn	
  America,	
  UK/Ireland	
  
Ne=lix.com	
  is	
  now	
  ~100%	
  Cloud	
  
  A	
  few	
  small	
  back	
  end	
  data	
  sources	
  sOll	
  in	
  progress	
  
          All	
  internaOonal	
  product	
  is	
  cloud	
  based	
  
   USA	
  specific	
  logisOcs	
  remains	
  in	
  the	
  Datacenter	
  
  Working	
  on	
  SOX,	
  PCI	
  as	
  scope	
  starts	
  to	
  include	
  AWS	
  
Ne=lix	
  Choice	
  was	
  AWS	
  with	
  our	
  
   own	
  pla=orm	
  and	
  tools	
  
     Unique	
  pla=orm	
  requirements	
  and	
  
     extreme	
  scale,	
  agility	
  and	
  flexibility	
  
Leverage	
  AWS	
  Scale	
  
   “the	
  biggest	
  public	
  cloud”	
  
 AWS	
  investment	
  in	
  features	
  and	
  automaOon	
  
Use	
  AWS	
  zones	
  and	
  regions	
  for	
  high	
  availability,	
  
         scalability	
  and	
  global	
  deployment	
  
But	
  isn’t	
  Amazon	
  a	
  compeOtor?	
  
Many	
  products	
  that	
  compete	
  with	
  Amazon	
  run	
  on	
  AWS	
  
  We	
  are	
  a	
  “poster	
  child”	
  for	
  the	
  AWS	
  Architecture	
  
      Ne=lix	
  is	
  one	
  of	
  the	
  biggest	
  AWS	
  customers	
  
    Co-­‐opeOOon	
  -­‐	
  compeOtors	
  are	
  also	
  partners	
  
Could	
  Ne=lix	
  use	
  another	
  cloud?	
  
 Would	
  be	
  nice,	
  we	
  use	
  three	
  interchangeable	
  CDN	
  Vendors	
  
    But	
  no-­‐one	
  else	
  has	
  the	
  scale	
  and	
  features	
  of	
  AWS	
  
            You	
  have	
  to	
  be	
  this	
  tall	
  to	
  ride	
  this	
  ride…	
  
                               Maybe	
  in	
  2-­‐3	
  years?	
  
We	
  want	
  to	
  use	
  clouds,	
  
     we	
  don’t	
  have	
  Ome	
  to	
  build	
  them	
  
                             Public	
  cloud	
  for	
  agility	
  and	
  scale	
  
We	
  use	
  electricity	
  too,	
  but	
  don’t	
  want	
  to	
  build	
  our	
  own	
  power	
  staOon…	
  
AWS	
  because	
  they	
  are	
  big	
  enough	
  to	
  allocate	
  thousands	
  of	
  instances	
  per	
  
                                     hour	
  when	
  we	
  need	
  to	
  
What	
  about	
  other	
  PaaS?	
  
•  CloudFoundry	
  –	
  Open	
  Source	
  by	
  VMWare	
  
    –  Developer-­‐friendly,	
  easy	
  to	
  get	
  started	
  
    –  Missing	
  scale	
  and	
  some	
  enterprise	
  features	
  
•  Rightscale	
  
    –  Widely	
  used	
  to	
  abstract	
  away	
  from	
  AWS	
  
    –  Creates	
  it’s	
  own	
  lock-­‐in	
  problem…	
  
•  AWS	
  is	
  growing	
  into	
  this	
  space	
  
    –  We	
  didn’t	
  want	
  a	
  vendor	
  between	
  us	
  and	
  AWS	
  
    –  We	
  wanted	
  to	
  build	
  a	
  thin	
  PaaS,	
  that	
  gets	
  thinner	
  
Ne=lix	
  Deployed	
  on	
  AWS	
  
  2009	
           2009	
              2010	
              2010	
            2010	
             2011	
  

Content	
          Logs	
              Play	
              WWW	
             API	
                CS	
  
    Video	
                                                                                     InternaOonal	
  
   Masters	
             S3	
              DRM	
             Sign-­‐Up	
      Metadata	
          CS	
  lookup	
  


                                                                                Device	
         DiagnosOcs	
  
     EC2	
         EMR	
  Hadoop	
     CDN	
  rouOng	
        Search	
          Config	
           &	
  AcOons	
  


                                                              Movie	
         TV	
  Movie	
       Customer	
  
      S3	
              Hive	
         Bookmarks	
           Choosing	
       Choosing	
           Call	
  Log	
  


                     Business	
                                                 Social	
  
    CDNs	
                                Logging	
           RaOngs	
        Facebook	
        CS	
  AnalyOcs	
  
                   Intelligence	
  
Cloud	
  Architecture	
  Pa@erns	
  

        Where	
  do	
  we	
  start?	
  
Goals	
  
•  Faster	
  
     –  Lower	
  latency	
  than	
  the	
  equivalent	
  datacenter	
  web	
  pages	
  and	
  API	
  calls	
  
     –  Measured	
  as	
  mean	
  and	
  99th	
  percenOle	
  
     –  For	
  both	
  first	
  hit	
  (e.g.	
  home	
  page)	
  and	
  in-­‐session	
  hits	
  for	
  the	
  same	
  user	
  
•  Scalable	
  
     –  Avoid	
  needing	
  any	
  more	
  datacenter	
  capacity	
  as	
  subscriber	
  count	
  increases	
  
     –  No	
  central	
  verOcally	
  scaled	
  databases	
  
     –  Leverage	
  AWS	
  elasOc	
  capacity	
  effecOvely	
  
•  Available	
  
     –  SubstanOally	
  higher	
  robustness	
  and	
  availability	
  than	
  datacenter	
  services	
  
     –  Leverage	
  mulOple	
  AWS	
  availability	
  zones	
  
     –  No	
  scheduled	
  down	
  Ome,	
  no	
  central	
  database	
  schema	
  to	
  change	
  
•  ProducOve	
  
     –  OpOmize	
  agility	
  of	
  a	
  large	
  development	
  team	
  with	
  automaOon	
  and	
  tools	
  
     –  Leave	
  behind	
  complex	
  tangled	
  datacenter	
  code	
  base	
  (~8	
  year	
  old	
  architecture)	
  
     –  Enforce	
  clean	
  layered	
  interfaces	
  and	
  re-­‐usable	
  components	
  
Datacenter	
  AnO-­‐Pa@erns	
  

 What	
  do	
  we	
  currently	
  do	
  in	
  the	
  
datacenter	
  that	
  prevents	
  us	
  from	
  
         meeOng	
  our	
  goals?	
  
                       	
  
Rewrite	
  from	
  Scratch	
  

Not	
  everything	
  is	
  cloud	
  specific	
  
  Pay	
  down	
  technical	
  debt	
  
          Robust	
  pa@erns	
  
Ne=lix	
  Datacenter	
  vs.	
  Cloud	
  Arch	
  
   Central	
  SQL	
  Database	
          Distributed	
  Key/Value	
  NoSQL	
  

SOcky	
  In-­‐Memory	
  Session	
         Shared	
  Memcached	
  Session	
  

      Cha@y	
  Protocols	
                 Latency	
  Tolerant	
  Protocols	
  

Tangled	
  Service	
  Interfaces	
         Layered	
  Service	
  Interfaces	
  

    Instrumented	
  Code	
              Instrumented	
  Service	
  Pa@erns	
  

   Fat	
  Complex	
  Objects	
          Lightweight	
  Serializable	
  Objects	
  

 Components	
  as	
  Jar	
  Files	
         Components	
  as	
  Services	
  
So9ware	
  Architecture	
  Pa@erns	
  
•  Object	
  Models	
  
   –  Basic	
  and	
  derived	
  types,	
  facets,	
  serializable	
  
   –  Pass	
  by	
  reference	
  within	
  a	
  service	
  
   –  Pass	
  by	
  value	
  between	
  services	
  
•  ComputaOon	
  and	
  I/O	
  Models	
  
   –  Service	
  ExecuOon	
  using	
  Best	
  Effort	
  /	
  Futures	
  
   –  Common	
  thread	
  pool	
  management	
  
   –  Circuit	
  breakers	
  to	
  manage	
  and	
  contain	
  failures	
  
Model	
  Driven	
  Architecture	
  
•  TradiOonal	
  Datacenter	
  PracOces	
  
   –  Lots	
  of	
  unique	
  hand-­‐tweaked	
  systems	
  
   –  Hard	
  to	
  enforce	
  pa@erns	
  
   –  Some	
  use	
  of	
  Puppet	
  to	
  automate	
  changes	
  

•  Model	
  Driven	
  Cloud	
  Architecture	
  
   –  Perforce/Ivy/Jenkins	
  based	
  builds	
  for	
  everything	
  
   –  Every	
  producOon	
  instance	
  is	
  a	
  pre-­‐baked	
  AMI	
  
   –  Every	
  applicaOon	
  is	
  managed	
  by	
  an	
  Autoscaler	
  

                       Every	
  change	
  is	
  a	
  new	
  AMI	
  
Ne=lix	
  PaaS	
  Principles	
  
•  Maximum	
  FuncOonality	
  
    –  Developer	
  producOvity	
  and	
  agility	
  
•  Leverage	
  as	
  much	
  of	
  AWS	
  as	
  possible	
  
    –  AWS	
  is	
  making	
  huge	
  investments	
  in	
  features/scale	
  
•  Interfaces	
  that	
  isolate	
  Apps	
  from	
  AWS	
  
    –  Avoid	
  lock-­‐in	
  to	
  specific	
  AWS	
  API	
  details	
  
•  Portability	
  is	
  a	
  long	
  term	
  goal	
  
    –  Gets	
  easier	
  as	
  other	
  vendors	
  catch	
  up	
  with	
  AWS	
  
Ne=lix	
  Global	
  PaaS	
  
•    Architecture	
  Features	
  and	
  Overview	
  
•    Portals	
  and	
  Explorers	
  
•    Pla=orm	
  Services	
  
•    Pla=orm	
  APIs	
  
•    Pla=orm	
  Frameworks	
  
•    Persistence	
  
•    Scalability	
  Benchmark	
  
Global	
  PaaS?	
  
            Toys	
  are	
  nice,	
  but	
  this	
  is	
  the	
  real	
  thing…	
  
•    Supports	
  all	
  AWS	
  Availability	
  Zones	
  and	
  Regions	
  
•    Supports	
  mulOple	
  AWS	
  accounts	
  {test,	
  prod,	
  etc.}	
  
•    Cross	
  Region/Acct	
  Data	
  ReplicaOon	
  and	
  Archiving	
  
•    InternaOonalized,	
  Localized	
  and	
  GeoIP	
  rouOng	
  
•    Security	
  is	
  fine	
  grain,	
  dynamic	
  AWS	
  keys	
  
•    Autoscaling	
  to	
  thousands	
  of	
  instances	
  
•    Monitoring	
  for	
  millions	
  of	
  metrics	
  
•    ProducOve	
  for	
  100s	
  of	
  developers	
  on	
  one	
  product	
  
•    23M+	
  users	
  USA,	
  Canada,	
  LaOn	
  America,	
  UK,	
  Eire	
  
Basic	
  PaaS	
  EnOOes	
  
•  AWS	
  Based	
  EnOOes	
  
    –  Instances	
  and	
  Machine	
  Images,	
  ElasOc	
  IP	
  Addresses	
  
    –  Security	
  Groups,	
  Load	
  Balancers,	
  Autoscale	
  Groups	
  
    –  Availability	
  Zones	
  and	
  Geographic	
  Regions	
  


•  Ne=lix	
  PaaS	
  EnOOes	
  
    –  ApplicaOons	
  (registered	
  services)	
  
    –  Clusters	
  (versioned	
  Autoscale	
  Groups	
  for	
  an	
  App)	
  
    –  ProperOes	
  (dynamic	
  hierarchical	
  configuraOon)	
  
Core	
  PaaS	
  Services	
  
•  AWS	
  Based	
  Services	
  
    –  S3	
  storage,	
  to	
  5TB	
  files,	
  parallel	
  mulOpart	
  writes	
  
    –  SQS	
  –	
  Simple	
  Queue	
  Service.	
  Messaging	
  layer.	
  

•  Ne=lix	
  Based	
  Services	
  
    –  EVCache	
  –	
  memcached	
  based	
  ephemeral	
  cache	
  
    –  Cassandra	
  –	
  distributed	
  data	
  store	
  

•  External	
  Services	
  
    –  GeoIP	
  Lookup	
  interfaced	
  to	
  a	
  vendor	
  
    –  Keystore	
  HSM	
  in	
  Ne=lix	
  Datacenter	
  
Instance	
  Architecture	
  

Linux	
  Base	
  AMI	
  (CentOS	
  or	
  Ubuntu)	
  
   OpOonal	
  
   Apache	
  
  frontend,	
  
                          Java	
  (JDK	
  6	
  or	
  7)	
  
memcached,	
  
non-­‐java	
  apps	
  


                                                    Tomcat	
  
                          AppDynamics	
  
                            appagent	
  
 Monitoring	
  
 Log	
  rotaOon	
                                     ApplicaOon	
  servlet,	
  base	
           Healthcheck,	
  status	
  
    to	
  S3	
            GC	
  and	
  thread	
      server,	
  pla=orm,	
  interface	
        servlets,	
  JMX	
  interface,	
  
AppDynamics	
             dump	
  logging	
         jars	
  for	
  dependent	
  services	
         Servo	
  autoscale	
  
machineagent	
  
        Epic	
  	
  
Security	
  Architecture	
  
•  Instance	
  Level	
  Security	
  baked	
  into	
  base	
  AMI	
  
    –  Login:	
  ssh	
  only	
  allowed	
  via	
  portal	
  (not	
  between	
  instances)	
  
    –  Each	
  app	
  type	
  runs	
  as	
  its	
  own	
  userid	
  app{test|prod}	
  

•  AWS	
  Security,	
  IdenOty	
  and	
  Access	
  Management	
  
    –  Each	
  app	
  has	
  its	
  own	
  security	
  group	
  (firewall	
  ports)	
  
    –  Fine	
  grain	
  user	
  roles	
  and	
  resource	
  ACLs	
  

•  Key	
  Management	
  
    –  AWS	
  Keys	
  dynamically	
  provisioned,	
  easy	
  updates	
  
    –  High	
  grade	
  app	
  specific	
  key	
  management	
  support	
  
Portals	
  and	
  Explorers	
  
•  Ne=lix	
  ApplicaOon	
  Console	
  (NAC)	
  
   –  Primary	
  AWS	
  provisioning/config	
  interface	
  
•  AWS	
  Usage	
  Analyzer	
  
   –  Breaks	
  down	
  costs	
  by	
  applicaOon	
  and	
  resource	
  
•  Cassandra	
  Explorer	
  
   –  Browse	
  clusters,	
  keyspaces,	
  column	
  families	
  
•  Base	
  Server	
  Explorer	
  
   –  Browse	
  service	
  endpoints	
  configuraOon,	
  perf	
  
Pla=orm	
  Services	
  
•    Discovery	
  –	
  service	
  registry	
  for	
  “ApplicaOons”	
  
•    IntrospecOon	
  –	
  Entrypoints	
  
•    Cryptex	
  –	
  Dynamic	
  security	
  key	
  management	
  
•    Geo	
  –	
  Geographic	
  IP	
  lookup	
  
•    Pla=ormservice	
  –	
  Dynamic	
  property	
  configuraOon	
  
•    LocalizaOon	
  –	
  manage	
  and	
  lookup	
  local	
  translaOons	
  
•    Evcache	
  –	
  ephemeral	
  volaOle	
  cache	
  
•    Cassandra	
  –	
  Cross	
  zone/region	
  distributed	
  data	
  store	
  
•    Zookeeper	
  –	
  Distributed	
  CoordinaOon	
  (Curator)	
  
•    Various	
  proxies	
  –	
  access	
  to	
  old	
  datacenter	
  stuff	
  
Metrics	
  Framework	
  
•  System	
  and	
  ApplicaOon	
  
    –  CollecOon,	
  AggregaOon,	
  Querying	
  and	
  ReporOng	
  
    –  Non-­‐blocking	
  logging,	
  avoids	
  log4j	
  lock	
  contenOon	
  
    –  Honu-­‐Streaming	
  -­‐>	
  S3	
  -­‐>	
  EMR	
  -­‐>	
  Hive	
  
•  Performance,	
  Robustness,	
  Monitoring,	
  Analysis	
  
    –  Tracers,	
  Counters	
  –	
  explicit	
  code	
  instrumentaOon	
  log	
  
    –  Real	
  Time	
  Tracers/Counters	
  
    –  SLA	
  –	
  service	
  level	
  response	
  Ome	
  percenOles	
  
    –  Servo	
  annotated	
  JMX	
  extract	
  to	
  Cloudwatch	
  
•  Latency	
  Monkey	
  Infrastructure	
  
    –  Inject	
  random	
  delays	
  into	
  service	
  responses	
  
Ne0lix	
  Pla0orm	
  Persistence	
  
•  Ephemeral	
  VolaOle	
  Cache	
  –	
  evcache	
  
   –  Discovery-­‐aware	
  memcached	
  based	
  backend	
  
   –  Client	
  abstracOons	
  for	
  zone	
  aware	
  replicaOon	
  
   –  OpOon	
  to	
  write	
  to	
  all	
  zones,	
  fast	
  read	
  from	
  local	
  
•  Cassandra	
  
   –  Highly	
  available	
  and	
  scalable	
  (more	
  later…)	
  
•  MongoDB	
  
   –  Complex	
  object/query	
  model	
  for	
  small	
  scale	
  use	
  
•  MySQL	
  
   –  Hard	
  to	
  scale,	
  legacy	
  and	
  small	
  relaOonal	
  models	
  
Priam	
  –	
  Cassandra	
  AutomaOon	
  
                Available	
  at	
  h@p://github.com/ne=lix	
  

•    Ne=lix	
  Pla=orm	
  Tomcat	
  Code	
  
•    Zero	
  touch	
  auto-­‐configuraOon	
  
•    State	
  management	
  for	
  Cassandra	
  JVM	
  
•    Token	
  allocaOon	
  and	
  assignment	
  
•    Broken	
  node	
  auto-­‐replacement	
  
•    Full	
  and	
  incremental	
  backup	
  to	
  S3	
  
•    Restore	
  sequencing	
  from	
  S3	
  
•    Grow/Shrink	
  Cassandra	
  “ring”	
  
Astyanax	
  
                         Available	
  at	
  h@p://github.com/ne=lix	
  

•  Cassandra	
  java	
  client	
  
•  API	
  abstracOon	
  on	
  top	
  of	
  Thri9	
  protocol	
  
•  “Fixed”	
  ConnecOon	
  Pool	
  abstracOon	
  (vs.	
  Hector)	
  
      –    Round	
  robin	
  with	
  Failover	
  
      –    Retry-­‐able	
  operaOons	
  not	
  Oed	
  to	
  a	
  connecOon	
  
      –    Ne=lix	
  PaaS	
  Discovery	
  service	
  integraOon	
  
      –    Host	
  reconnect	
  (fixed	
  interval	
  or	
  exponenOal	
  backoff)	
  
      –    Token	
  aware	
  to	
  save	
  a	
  network	
  hop	
  –	
  lower	
  latency	
  
      –    Latency	
  aware	
  to	
  avoid	
  compacOng/repairing	
  nodes	
  –	
  lower	
  variance	
  
•    Batch	
  mutaOon:	
  set,	
  put,	
  delete,	
  increment	
  
•    Simplified	
  use	
  of	
  serializers	
  via	
  method	
  overloading	
  (vs.	
  Hector)	
  
•    ConnecOonPoolMonitor	
  interface	
  for	
  counters	
  and	
  tracers	
  
•    Composite	
  Column	
  Names	
  replacing	
  deprecated	
  SuperColumns	
  
Astyanax	
  Query	
  Example	
  
Paginate	
  through	
  all	
  columns	
  in	
  a	
  row	
  
ColumnList<String>	
  columns;	
  
int	
  pageize	
  =	
  10;	
  
try	
  {	
  
	
  	
  	
  	
  RowQuery<String,	
  String>	
  query	
  =	
  keyspace	
  
	
  	
  	
  	
  	
  	
  	
  	
  .prepareQuery(CF_STANDARD1)	
  
	
  	
  	
  	
  	
  	
  	
  	
  .getKey("A")	
  
	
  	
  	
  	
  	
  	
  	
  	
  .setIsPaginaOng()	
  
	
  	
  	
  	
  	
  	
  	
  	
  .withColumnRange(new	
  RangeBuilder().setMaxSize(pageize).build());	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  while	
  (!(columns	
  =	
  query.execute().getResult()).isEmpty())	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  (Column<String>	
  c	
  :	
  columns)	
  {	
  
	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
}	
  catch	
  (ConnecOonExcepOon	
  e)	
  {	
  
} 	
  	
  
	
  
High	
  Availability	
  
•  Cassandra	
  stores	
  3	
  local	
  copies,	
  1	
  per	
  zone	
  
       –  Synchronous	
  access,	
  durable,	
  highly	
  available	
  
       –  Read/Write	
  One	
  fastest,	
  least	
  consistent	
  -­‐	
  ~1ms	
  
       –  Read/Write	
  Quorum	
  2	
  of	
  3,	
  consistent	
  -­‐	
  ~3ms	
  
•  AWS	
  Availability	
  Zones	
  
       –  Separate	
  buildings	
  
       –  Separate	
  power	
  etc.	
  
       –  Fairly	
  close	
  together	
  
	
  
“TradiOonal”	
  Cassandra	
  Write	
  Data	
  Flows	
  
            Single	
  Region,	
  MulOple	
  Availability	
  Zone,	
  Not	
  Token	
  Aware	
  

                                                               Cassandra	
  
                                                               • Disks	
  
                                                               • Zone	
  A	
  
                                                              2	
                 2	
  
                                                                        4	
   2	
  
1.  Client	
  Writes	
  to	
  any	
     Cassandra	
  3	
                                  3	
  
                                                                                           Cassandra	
         If	
  a	
  node	
  goes	
  offline,	
  
    Cassandra	
  Node	
                 • Disks	
   5                                      • Disks	
   5	
     hinted	
  handoff	
  
2.  Coordinator	
  Node	
               • Zone	
  C	
                  1                   • Zone	
  A	
       completes	
  the	
  write	
  
    replicates	
  to	
  nodes	
                                                                                when	
  the	
  node	
  comes	
  
    and	
  Zones	
  
                                                             Non	
  Token	
                                    back	
  up.	
  
3.  Nodes	
  return	
  ack	
  to	
  
                                                              Aware	
                                          	
  
    coordinator	
                                             Clients	
                                        Requests	
  can	
  choose	
  to	
  
4.  Coordinator	
  returns	
                                                                 3	
               wait	
  for	
  one	
  node,	
  a	
  
                                        Cassandra	
                                        Cassandra	
  
    ack	
  to	
  client	
               • Disks	
                                          • Disks	
   5	
     quorum,	
  or	
  all	
  nodes	
  to	
  
5.  Data	
  wri@en	
  to	
              • Zone	
  C	
                                      • Zone	
  B	
       ack	
  the	
  write	
  
    internal	
  commit	
  log	
                                                                                	
  
    disk	
  (no	
  more	
  than	
                              Cassandra	
                                     SSTable	
  disk	
  writes	
  and	
  
                                                               • Disks	
  
    10	
  seconds	
  later)	
                                  • Zone	
  B	
  
                                                                                                               compacOons	
  occur	
  
                                                                                                               asynchronously	
  
Astyanax	
  -­‐	
  Cassandra	
  Write	
  Data	
  Flows	
  
                Single	
  Region,	
  MulOple	
  Availability	
  Zone,	
  Token	
  Aware	
  

                                                            Cassandra	
  
                                                            • Disks	
  
                                                            • Zone	
  A	
  

1.  Client	
  Writes	
  to	
           Cassandra	
  2	
                       2	
  
                                                                               Cassandra	
         If	
  a	
  node	
  goes	
  offline,	
  
    nodes	
  and	
  Zones	
            • Disks	
   3                           • Disks	
   3	
     hinted	
  handoff	
  
2.  Nodes	
  return	
  ack	
  to	
     • Zone	
  C	
                1          • Zone	
  A	
       completes	
  the	
  write	
  
    client	
  
3.  Data	
  wri@en	
  to	
  
                                                            Token	
                                when	
  the	
  node	
  comes	
  
                                                                                                   back	
  up.	
  
    internal	
  commit	
  log	
                             Aware	
                                	
  
    disks	
  (no	
  more	
  than	
                          Clients	
            2	
  
                                                                                                   Requests	
  can	
  choose	
  to	
  
    10	
  seconds	
  later)	
          Cassandra	
                             Cassandra	
         wait	
  for	
  one	
  node,	
  a	
  
                                       • Disks	
                               • Disks	
   3	
     quorum,	
  or	
  all	
  nodes	
  to	
  
                                       • Zone	
  C	
                           • Zone	
  B	
       ack	
  the	
  write	
  
                                                                                                   	
  
                                                            Cassandra	
                            SSTable	
  disk	
  writes	
  and	
  
                                                            • Disks	
  
                                                            • Zone	
  B	
  
                                                                                                   compacOons	
  occur	
  
                                                                                                   asynchronously	
  
Data	
  Flows	
  for	
  MulO-­‐Region	
  Writes	
  
              Token	
  Aware,	
  Consistency	
  Level	
  =	
  Local	
  Quorum	
  

1.  Client	
  writes	
  to	
  local	
  replicas	
                                If	
  a	
  node	
  or	
  region	
  goes	
  offline,	
  hinted	
  handoff	
  
2.  Local	
  write	
  acks	
  returned	
  to	
                                   completes	
  the	
  write	
  when	
  the	
  node	
  comes	
  back	
  up.	
  
    Client	
  which	
  conOnues	
  when	
                                        Nightly	
  global	
  compare	
  and	
  repair	
  jobs	
  ensure	
  
    2	
  of	
  3	
  local	
  nodes	
  are	
                                      everything	
  stays	
  consistent.	
  
    commi@ed	
  
3.  Local	
  coordinator	
  writes	
  to	
  
    remote	
  coordinator.	
  	
                                                  Cassandra	
                           100+ms	
  latency	
  
4.  When	
  data	
  arrives,	
  remote	
  
                                                                                                                                                                Cassandra	
  
                                                                                  •  Disks	
                                                                    •  Disks	
  
                                                                                  •  Zone	
  A	
                                                                •  Zone	
  A	
  

    coordinator	
  node	
  acks	
  and	
              Cassandra	
        2	
                          2	
  
                                                                                                     Cassandra	
                           Cassandra	
                             4	
  
                                                                                                                                                                                    Cassandra	
  
                                                                6	
                                                6	
   3	
            5	
   Disks	
  6	
  
    copies	
  to	
  other	
  remote	
  zones	
                                                                                                                                              6	
  
                                                      •  Disks	
                                     •  Disks	
  
                                                      •  Zone	
  C	
                                 •  Zone	
  A	
  
                                                                                                                                         • 
                                                                                                                                           •  Zone	
  C	
                          4	
  Disks	
  A	
  
                                                                                                                                                                                    • 
                                                                                                                                                                                    •  Zone	
  
                                                                                           1	
  
                                                                                                                                                                                           4	
  
5.  Remote	
  nodes	
  ack	
  to	
  local	
                                        US	
                                                                          EU	
  
    coordinator	
                                                                Clients	
                                                                     Clients	
  
                                                      Cassandra	
                                          2	
  
                                                                                                     Cassandra	
                           Cassandra	
                              5	
  
                                                                                                                                                                                    Cassandra	
  
6.  Data	
  flushed	
  to	
  internal	
                •  Disks	
  
                                                      •  Zone	
  C	
  
                                                                                                     •  Disks	
  
                                                                                                                   6	
  
                                                                                                     •  Zone	
  B	
  
                                                                                                                                           •  Disks	
  
                                                                                                                                           •  Zone	
  C	
  
                                                                                                                                                                                    •  Disks	
  6	
  
                                                                                                                                                                                    •  Zone	
  B	
  

    commit	
  log	
  disks	
  (no	
  more	
                                       Cassandra	
                                                                   Cassandra	
  

    than	
  10	
  seconds	
  later)	
  
                                                                                  •  Disks	
                                                                    •  Disks	
  
                                                                                  •  Zone	
  B	
                                                                •  Zone	
  B	
  
Cassandra	
  Backup	
  	
  
•  Full	
  Backup	
                                                                      Cassandra	
  

                                                                  Cassandra	
                                   Cassandra	
  

    –  Time	
  based	
  snapshot	
  
    –  SSTable	
  compress	
  -­‐>	
  S3	
        Cassandra	
                                                                   Cassandra	
  




•  Incremental	
                                                                           S3	
  
                                                                                         Backup	
  
                                               Cassandra	
                                                                         Cassandra	
  

    –  SSTable	
  write	
  triggers	
  
       compressed	
  copy	
  to	
  S3	
                  Cassandra	
                                                     Cassandra	
  


•  Archive	
                                                                 Cassandra	
             Cassandra	
  


    –  Copy	
  cross	
  region	
  
                                                      A	
  
ETL	
  for	
  Cassandra	
  
•    Data	
  is	
  de-­‐normalized	
  over	
  many	
  clusters!	
  
•    Too	
  many	
  to	
  restore	
  from	
  backups	
  for	
  ETL	
  
•    SoluOon	
  –	
  read	
  backup	
  files	
  using	
  Hadoop	
  
•    Aegisthus	
  
      –  h@p://techblog.ne=lix.com/2012/02/aegisthus-­‐bulk-­‐data-­‐pipeline-­‐out-­‐of.html	
  

      –  High	
  throughput	
  raw	
  SSTable	
  processing	
  
      –  Re-­‐normalizes	
  many	
  clusters	
  to	
  a	
  consistent	
  view	
  
      –  Extract,	
  Transform,	
  then	
  Load	
  into	
  Teradata	
  
Cassandra	
  Archive	
                                             A	
  

                     Appropriate	
  level	
  of	
  paranoia	
  needed…                       	
  
•  Archive	
  could	
  be	
  un-­‐readable	
  
     –  Restore	
  S3	
  backups	
  weekly	
  from	
  prod	
  to	
  test,	
  and	
  daily	
  ETL	
  

•  Archive	
  could	
  be	
  stolen	
  
     –  PGP	
  Encrypt	
  archive	
  

•  AWS	
  East	
  Region	
  could	
  have	
  a	
  problem	
  
     –  Copy	
  data	
  to	
  AWS	
  West	
  

•  ProducOon	
  AWS	
  Account	
  could	
  have	
  an	
  issue	
  
     –  Separate	
  Archive	
  account	
  with	
  no-­‐delete	
  S3	
  ACL	
  

•  AWS	
  S3	
  could	
  have	
  a	
  global	
  problem	
  
     –  Create	
  an	
  extra	
  copy	
  on	
  a	
  different	
  cloud	
  vendor….	
  
Tools	
  and	
  AutomaOon	
  
•  Developer	
  and	
  Build	
  Tools	
  
      –  Jira,	
  Perforce,	
  Eclipse,	
  Jenkins,	
  Ivy,	
  ArOfactory	
  
      –  Builds,	
  creates	
  .war	
  file,	
  .rpm,	
  bakes	
  AMI	
  and	
  launches	
  

•  Custom	
  Ne=lix	
  ApplicaOon	
  Console	
  
      –  AWS	
  Features	
  at	
  Enterprise	
  Scale	
  (hide	
  the	
  AWS	
  security	
  keys!)	
  
      –  Auto	
  Scaler	
  Group	
  is	
  unit	
  of	
  deployment	
  to	
  producOon	
  

•  Open	
  Source	
  +	
  Support	
  
      –  Apache,	
  Tomcat,	
  Cassandra,	
  Hadoop	
  
      –  Datastax	
  support	
  for	
  Cassandra,	
  AWS	
  support	
  for	
  Hadoop	
  via	
  EMR	
  

•  Monitoring	
  Tools	
  
      –  Alert	
  processing	
  gateway	
  into	
  Pagerduty	
  
      –  AppDynamics	
  –	
  Developer	
  focus	
  for	
  cloud	
  h@p://appdynamics.com	
  
Open	
  Source	
  Strategy	
  
•  Release	
  PaaS	
  Components	
  git-­‐by-­‐git	
  
   –  Source	
  at	
  github.com/ne=lix	
  
   –  Intros	
  and	
  techniques	
  at	
  techblog.ne=lix.com	
  
   –  Blog	
  post	
  or	
  new	
  code	
  every	
  week	
  or	
  so	
  
•  MoOvaOons	
  
   –  Give	
  back	
  to	
  Apache	
  licensed	
  OSS	
  community	
  
   –  MoOvate,	
  retain,	
  hire	
  top	
  engineers	
  
   –  Create	
  a	
  community	
  that	
  adds	
  features	
  and	
  fixes	
  
Current	
  OSS	
  Projects	
  and	
  Posts	
  
Github	
  /	
  Techblog	
  
                                 Priam	
       Exhibitor	
             Servo	
  
  Apache	
  Project	
  

  Techblog	
  Post	
           Astyanax	
       Curator	
      Autoscaling	
  scripts	
  



                              CassJMeter	
     Zookeeper	
             Honu	
  



                              Cassandra	
       EVCache	
        Circuit	
  Breaker	
  



                               Aegisthus	
  
Scalability	
  TesOng	
  
•  Cloud	
  Based	
  TesOng	
  –	
  fricOonless,	
  elasOc	
  
    –  Create/destroy	
  any	
  sized	
  cluster	
  in	
  minutes	
  
    –  Many	
  test	
  scenarios	
  run	
  in	
  parallel	
  

•  Test	
  Scenarios	
  
    –  Internal	
  app	
  specific	
  tests	
  
    –  Simple	
  “stress”	
  tool	
  provided	
  with	
  Cassandra	
  

•  Scale	
  test,	
  keep	
  making	
  the	
  cluster	
  bigger	
  
    –  Check	
  that	
  tooling	
  and	
  automaOon	
  works…	
  
    –  How	
  many	
  ten	
  column	
  row	
  writes/sec	
  can	
  we	
  do?	
  
<DrEvil>ONE	
  MILLION</DrEvil>	
  
Scale-­‐Up	
  Linearity	
  
  h@p://techblog.ne=lix.com/2011/11/benchmarking-­‐cassandra-­‐scalability-­‐on.html	
  


                        Client	
  Writes/s	
  by	
  node	
  count	
  –	
  ReplicaJon	
  Factor	
  =	
  3	
  
1200000	
  
                                                                                                   1099837	
  
1000000	
  

 800000	
  

 600000	
  
                                                              537172	
  
 400000	
                                        366828	
  

 200000	
                           174373	
  

        0	
  
                0	
             50	
         100	
        150	
            200	
     250	
        300	
          350	
  
Availability	
  and	
  Resilience	
  
Chaos	
  Monkey	
  
•  Computers	
  (Datacenter	
  or	
  AWS)	
  randomly	
  die	
  
    –  Fact	
  of	
  life,	
  but	
  too	
  infrequent	
  to	
  test	
  resiliency	
  
•  Test	
  to	
  make	
  sure	
  systems	
  are	
  resilient	
  
    –  Allow	
  any	
  instance	
  to	
  fail	
  without	
  customer	
  impact	
  
•  Chaos	
  Monkey	
  hours	
  
    –  Monday-­‐Thursday	
  9am-­‐3pm	
  random	
  instance	
  kill	
  
•  ApplicaOon	
  configuraOon	
  opOon	
  
    –  Apps	
  now	
  have	
  to	
  opt-­‐out	
  from	
  Chaos	
  Monkey	
  
Responsibility	
  and	
  Experience	
  
•  Make	
  developers	
  responsible	
  for	
  failures	
  
    –  Then	
  they	
  learn	
  and	
  write	
  code	
  that	
  doesn’t	
  fail	
  
•  Use	
  Incident	
  Reviews	
  to	
  find	
  gaps	
  to	
  fix	
  
    –  Make	
  sure	
  its	
  not	
  about	
  finding	
  “who	
  to	
  blame”	
  
•  Keep	
  Omeouts	
  short,	
  fail	
  fast	
  
    –  Don’t	
  let	
  cascading	
  Omeouts	
  stack	
  up	
  
•  Make	
  configuraOon	
  opOons	
  dynamic	
  
    –  You	
  don’t	
  want	
  to	
  push	
  code	
  to	
  tweak	
  an	
  opOon	
  
Resilient	
  Design	
  –	
  Circuit	
  Breakers	
  
h@p://techblog.ne=lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html	
  
PaaS	
  OperaOonal	
  Model	
  
•  Developers	
  
   –  Provision	
  and	
  run	
  their	
  own	
  code	
  in	
  producOon	
  
   –  Take	
  turns	
  to	
  be	
  on	
  call	
  if	
  it	
  breaks	
  (pagerduty)	
  
   –  Configure	
  autoscalers	
  to	
  handle	
  capacity	
  needs	
  

•  DevOps	
  and	
  PaaS	
  (aka	
  NoOps)	
  
   –  DevOps	
  is	
  used	
  to	
  build	
  and	
  run	
  the	
  PaaS	
  
   –  PaaS	
  constrains	
  Dev	
  to	
  use	
  automaOon	
  instead	
  
   –  PaaS	
  puts	
  more	
  responsibility	
  on	
  Dev,	
  with	
  tools	
  
What’s	
  Le9	
  for	
  Corp	
  IT?	
  
•  Corporate	
  Security	
  and	
  Network	
  Management	
  
    –  Billing	
  and	
  remnants	
  of	
  streaming	
  service	
  back-­‐ends	
  in	
  DC	
  
•  Running	
  Ne=lix’	
  DVD	
  Business	
  
    –    Tens	
  of	
  Oracle	
  instances	
                          Corp	
  WiFi	
  Performance	
  
    –    Hundreds	
  of	
  MySQL	
  instances	
  
    –    Thousands	
  of	
  VMWare	
  VMs	
  
    –    Zabbix,	
  CacO,	
  Splunk,	
  Puppet	
  
•  Employee	
  ProducOvity	
  
    –    Building	
  networks	
  and	
  WiFi	
  
    –    SaaS	
  OneLogin	
  SSO	
  Portal	
  
    –    Evernote	
  Premium,	
  Safari	
  Online	
  Bookshelf,	
  Dropbox	
  for	
  Teams	
  
    –    Google	
  Enterprise	
  Apps,	
  Workday	
  HCM/Expense,	
  Box.com	
  
    –    Many	
  more	
  SaaS	
  migraOons	
  coming…	
  
ImplicaOons	
  for	
  IT	
  OperaOons	
  
•  Cloud	
  is	
  run	
  by	
  developer	
  organizaOon	
  
     –  Product	
  group’s	
  “IT	
  department”	
  is	
  the	
  AWS	
  API	
  and	
  PaaS	
  
     –  CorpIT	
  handles	
  billing	
  and	
  some	
  security	
  funcOons	
  

Cloud	
  capacity	
  is	
  10x	
  bigger	
  than	
  Datacenter	
  
     –  Datacenter	
  oriented	
  IT	
  didn’t	
  scale	
  up	
  as	
  we	
  grew	
  
     –  We	
  moved	
  a	
  few	
  people	
  out	
  of	
  IT	
  to	
  do	
  DevOps	
  for	
  our	
  PaaS	
  

•  TradiOonal	
  IT	
  Roles	
  and	
  Silos	
  are	
  going	
  away	
  
     –  We	
  don’t	
  have	
  SA,	
  DBA,	
  Storage,	
  Network	
  admins	
  for	
  cloud	
  
     –  Developers	
  deploy	
  and	
  “run	
  what	
  they	
  wrote”	
  in	
  producOon	
  
Ne=lix	
  PaaS	
  OrganizaOon	
  
  Developer	
  Org	
  ReporOng	
  into	
  Product	
  Development,	
  not	
  ITops                                                                          	
  

                 Ne=lix	
  Cloud	
  Pla=orm	
  Team	
  
 Cloud	
  Ops	
                                       Build	
  Tools	
              Pla=orm	
  and	
  
                                                                                                               Cloud	
                  Cloud	
  
 Reliability	
              Architecture	
                and	
                       Database	
  
                                                                                                            Performance	
              SoluOons	
  
Engineering	
                                         AutomaOon	
                   Engineering	
  


                                                       Perforce	
  Jenkins	
          Pla=orm	
  jars	
        Cassandra	
  
                            Future	
  planning	
       ArOfactory	
  JIRA	
                                  Benchmarking	
              Monitoring	
  
  Alert	
  RouOng	
                                                                     Key	
  store	
  
                             Security	
  Arch	
                                                                                           Monkeys	
  
Incident	
  Lifecycle	
                               Base	
  AMI,	
  Bakery	
         Zookeeper	
           JVM	
  GC	
  Tuning	
  
                                Efficiency	
           Ne=lix	
  App	
  Console	
                               Wiresharking	
             Entrypoints	
  
                                                                                       Cassandra	
  



                               AWS	
  VPC	
  
    PagerDuty	
               Hyperguard	
                  AWS	
  API	
             AWS	
  Instances	
      AWS	
  Instances	
        AWS	
  Instances	
  
                             Powerpoint	
  J	
  
Roadmap	
  for	
  2012	
  
•    Readiness	
  for	
  global	
  Ne=lix	
  launches	
  
•    More	
  resiliency	
  and	
  improved	
  availability	
  
•    More	
  automaOon,	
  orchestraOon	
  
•    “Hardening”	
  the	
  pla=orm	
  
•    Lower	
  latency	
  for	
  web	
  services	
  and	
  devices	
  
•    Working	
  towards	
  IPv6	
  support	
  
•    More	
  open	
  sourced	
  components	
  
Wrap	
  Up	
  
                                    	
  
        Answer	
  your	
  remaining	
  quesOons…	
  
                                    	
  
What	
  was	
  missing	
  that	
  you	
  wanted	
  to	
  cover?	
  
                                    	
  
Next	
  up	
  –	
  Jason	
  Chan	
  on	
  Security	
  Architecture	
  
Takeaway	
  
                                                     	
  
 Ne>lix	
  has	
  built	
  and	
  deployed	
  a	
  scalable	
  global	
  Pla>orm	
  as	
  a	
  Service.	
  
                                                     	
  
Key	
  components	
  of	
  the	
  Ne>lix	
  PaaS	
  are	
  being	
  released	
  as	
  Open	
  Source	
  
                   projects	
  so	
  you	
  can	
  build	
  your	
  own	
  custom	
  PaaS.	
  
                                                     	
  
                                  h@p://github.com/Ne=lix	
  
                                 h@p://techblog.ne=lix.com	
  
                                 h@p://slideshare.net/Ne=lix	
  
                                               	
  
                          h@p://www.linkedin.com/in/adriancockcro9	
  
                                  @adrianco	
  #ne=lixcloud	
  

More Related Content

What's hot

Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Timothy McAliley
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
Amazon Web Services
 
Cloud Streaming Platform @Generali Switzerland
Cloud Streaming Platform @Generali SwitzerlandCloud Streaming Platform @Generali Switzerland
Cloud Streaming Platform @Generali Switzerland
confluent
 
Building PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
Building PCI Compliance Solution on AWS - Pop-up Loft Tel AvivBuilding PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
Building PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
Amazon Web Services
 
Overview of API Management Architectures
Overview of API Management ArchitecturesOverview of API Management Architectures
Overview of API Management Architectures
Nordic APIs
 
Splunk-Presentation
Splunk-Presentation Splunk-Presentation
Splunk-Presentation
PrasadThorat23
 
Cloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
Cloud Journey Roadmap: Capgemini's Cloud Readiness AssessmentCloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
Cloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
Capgemini
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
Amazon Web Services
 
Module 1 - AWSome Day Online Conference Thailand
Module 1 - AWSome Day Online Conference Thailand Module 1 - AWSome Day Online Conference Thailand
Module 1 - AWSome Day Online Conference Thailand
Amazon Web Services
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
Damien Garros
 
Cloud Migration: A How-To Guide
Cloud Migration: A How-To GuideCloud Migration: A How-To Guide
Cloud Migration: A How-To Guide
Amazon Web Services
 
Agile Testing in the Cloud
Agile Testing in the CloudAgile Testing in the Cloud
Agile Testing in the Cloud
Cygnet Infotech
 
API Management - Why it matters!
API Management - Why it matters!API Management - Why it matters!
API Management - Why it matters!
Sven Bernhardt
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Amazon Web Services
 
Scaling Push Messaging for Millions of Netflix Devices
Scaling Push Messaging for Millions of Netflix DevicesScaling Push Messaging for Millions of Netflix Devices
Scaling Push Messaging for Millions of Netflix Devices
Susheel Aroskar
 
Cloud Migration: Cloud Readiness Assessment Case Study
Cloud Migration: Cloud Readiness Assessment Case StudyCloud Migration: Cloud Readiness Assessment Case Study
Cloud Migration: Cloud Readiness Assessment Case Study
CAST
 
Wso2 API Manager Fundamentals
Wso2 API Manager FundamentalsWso2 API Manager Fundamentals
Wso2 API Manager Fundamentals
Rajith Siriwardana
 
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaGoogle Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Edureka!
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
Amazon Web Services
 
amazon database
amazon databaseamazon database
amazon database
PrasannaBhalerao3
 

What's hot (20)

Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
Cloud Streaming Platform @Generali Switzerland
Cloud Streaming Platform @Generali SwitzerlandCloud Streaming Platform @Generali Switzerland
Cloud Streaming Platform @Generali Switzerland
 
Building PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
Building PCI Compliance Solution on AWS - Pop-up Loft Tel AvivBuilding PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
Building PCI Compliance Solution on AWS - Pop-up Loft Tel Aviv
 
Overview of API Management Architectures
Overview of API Management ArchitecturesOverview of API Management Architectures
Overview of API Management Architectures
 
Splunk-Presentation
Splunk-Presentation Splunk-Presentation
Splunk-Presentation
 
Cloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
Cloud Journey Roadmap: Capgemini's Cloud Readiness AssessmentCloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
Cloud Journey Roadmap: Capgemini's Cloud Readiness Assessment
 
Cost Optimisation on AWS
Cost Optimisation on AWSCost Optimisation on AWS
Cost Optimisation on AWS
 
Module 1 - AWSome Day Online Conference Thailand
Module 1 - AWSome Day Online Conference Thailand Module 1 - AWSome Day Online Conference Thailand
Module 1 - AWSome Day Online Conference Thailand
 
Ansiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at robloxAnsiblefest 2018 Network automation journey at roblox
Ansiblefest 2018 Network automation journey at roblox
 
Cloud Migration: A How-To Guide
Cloud Migration: A How-To GuideCloud Migration: A How-To Guide
Cloud Migration: A How-To Guide
 
Agile Testing in the Cloud
Agile Testing in the CloudAgile Testing in the Cloud
Agile Testing in the Cloud
 
API Management - Why it matters!
API Management - Why it matters!API Management - Why it matters!
API Management - Why it matters!
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
 
Scaling Push Messaging for Millions of Netflix Devices
Scaling Push Messaging for Millions of Netflix DevicesScaling Push Messaging for Millions of Netflix Devices
Scaling Push Messaging for Millions of Netflix Devices
 
Cloud Migration: Cloud Readiness Assessment Case Study
Cloud Migration: Cloud Readiness Assessment Case StudyCloud Migration: Cloud Readiness Assessment Case Study
Cloud Migration: Cloud Readiness Assessment Case Study
 
Wso2 API Manager Fundamentals
Wso2 API Manager FundamentalsWso2 API Manager Fundamentals
Wso2 API Manager Fundamentals
 
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaGoogle Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
 
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
amazon database
amazon databaseamazon database
amazon database
 

Viewers also liked

Cloud Security at Netflix
Cloud Security at NetflixCloud Security at Netflix
Cloud Security at Netflix
Jason Chan
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Adrian Cockcroft
 
Cloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
Cloud-powered Continuous Integration and Deployment architectures - Jinesh VariaCloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
Cloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
Amazon Web Services
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
Adrian Cockcroft
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
Adrian Cockcroft
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
Adrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Adrian Cockcroft
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
Adrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
Adrian Cockcroft
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDB
Amazon Web Services
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
Adrian Cockcroft
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
Jason Brown
 
When Developers Operate and Operators Develop
When Developers Operate and Operators DevelopWhen Developers Operate and Operators Develop
When Developers Operate and Operators Develop
Adrian Cockcroft
 
Openstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock InOpenstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock In
Adrian Cockcroft
 
Cloud Trends Nov2015 Structure
Cloud Trends Nov2015 StructureCloud Trends Nov2015 Structure
Cloud Trends Nov2015 Structure
Adrian Cockcroft
 
What's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at CiscoWhat's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at Cisco
Adrian Cockcroft
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
Adrian Cockcroft
 
Microxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for MicroservicesMicroxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for Microservices
Adrian Cockcroft
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Adrian Cockcroft
 

Viewers also liked (20)

Cloud Security at Netflix
Cloud Security at NetflixCloud Security at Netflix
Cloud Security at Netflix
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Cloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
Cloud-powered Continuous Integration and Deployment architectures - Jinesh VariaCloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
Cloud-powered Continuous Integration and Deployment architectures - Jinesh Varia
 
Dystopia as a Service
Dystopia as a ServiceDystopia as a Service
Dystopia as a Service
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDB
 
Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016Microservices Workshop All Topics Deck 2016
Microservices Workshop All Topics Deck 2016
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
 
When Developers Operate and Operators Develop
When Developers Operate and Operators DevelopWhen Developers Operate and Operators Develop
When Developers Operate and Operators Develop
 
Openstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock InOpenstack Silicon Valley - Vendor Lock In
Openstack Silicon Valley - Vendor Lock In
 
Cloud Trends Nov2015 Structure
Cloud Trends Nov2015 StructureCloud Trends Nov2015 Structure
Cloud Trends Nov2015 Structure
 
What's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at CiscoWhat's Missing? Microservices Meetup at Cisco
What's Missing? Microservices Meetup at Cisco
 
Netflix in the Cloud
Netflix in the CloudNetflix in the Cloud
Netflix in the Cloud
 
Microxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for MicroservicesMicroxchg Analyzing Response Time Distributions for Microservices
Microxchg Analyzing Response Time Distributions for Microservices
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 

Similar to Netflix in the Cloud at SV Forum

Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
Adrian Cockcroft
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
Adrian Cockcroft
 
Netflix web-adrian-qcon
Netflix web-adrian-qconNetflix web-adrian-qcon
Netflix web-adrian-qcon
Yiwei Ma
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloudA1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
Dr. Wilfred Lin (Ph.D.)
 
Netflix keynote-adrian-qcon
Netflix keynote-adrian-qconNetflix keynote-adrian-qcon
Netflix keynote-adrian-qcon
Yiwei Ma
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
Adrian Cockcroft
 
AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
Amazon Web Services
 
Cloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New InfrastructureCloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New Infrastructure
Amazon Web Services
 
Moving Viadeo to AWS (2015)
Moving Viadeo to AWS (2015)Moving Viadeo to AWS (2015)
Moving Viadeo to AWS (2015)
Julien SIMON
 
Intro to cloud.pdf
Intro to cloud.pdfIntro to cloud.pdf
Intro to cloud.pdf
SawanBhattacharya
 
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
Anuj Dewangan
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
Helen Rogers
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Marc Dutoo
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The CloudO'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Media
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
Amazon Web Services
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
Amazon Web Services
 

Similar to Netflix in the Cloud at SV Forum (20)

Netflix Velocity Conference 2011
Netflix Velocity Conference 2011Netflix Velocity Conference 2011
Netflix Velocity Conference 2011
 
Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3) Cloud Architecture Tutorial - Why and What (1of 3)
Cloud Architecture Tutorial - Why and What (1of 3)
 
Netflix web-adrian-qcon
Netflix web-adrian-qconNetflix web-adrian-qcon
Netflix web-adrian-qcon
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloudA1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
A1 keynote oracle_infrastructure_as_a_service_move_any_workload_to_the_cloud
 
Netflix keynote-adrian-qcon
Netflix keynote-adrian-qconNetflix keynote-adrian-qcon
Netflix keynote-adrian-qcon
 
Migrating to Public Cloud
Migrating to Public CloudMigrating to Public Cloud
Migrating to Public Cloud
 
AWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWSAWS Cloud Kata | Manila - Getting to Scale on AWS
AWS Cloud Kata | Manila - Getting to Scale on AWS
 
Cloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New InfrastructureCloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New Infrastructure
 
Moving Viadeo to AWS (2015)
Moving Viadeo to AWS (2015)Moving Viadeo to AWS (2015)
Moving Viadeo to AWS (2015)
 
Intro to cloud.pdf
Intro to cloud.pdfIntro to cloud.pdf
Intro to cloud.pdf
 
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
re:Invent 2019 CMP320 - How Dropbox leverages hybrid cloud for scale and inno...
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, SmileOCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
OCCIware presentation at EclipseDay in Lyon, November 2017, by Marc Dutoo, Smile
 
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
Model and pilot all cloud layers with OCCIware - Eclipse Day Lyon 2017
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
O'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The CloudO'Reilly Webcast: Architecting Applications For The Cloud
O'Reilly Webcast: Architecting Applications For The Cloud
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 

More from Adrian Cockcroft

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
Adrian Cockcroft
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
Adrian Cockcroft
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
Adrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Adrian Cockcroft
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
Adrian Cockcroft
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
Adrian Cockcroft
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
Adrian Cockcroft
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
Adrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
Adrian Cockcroft
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Adrian Cockcroft
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Adrian Cockcroft
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Adrian Cockcroft
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
Adrian Cockcroft
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
Adrian Cockcroft
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
Adrian Cockcroft
 

More from Adrian Cockcroft (17)

CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Netflix and Open Source
Netflix and Open SourceNetflix and Open Source
Netflix and Open Source
 
NetflixOSS Meetup
NetflixOSS MeetupNetflixOSS Meetup
NetflixOSS Meetup
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)Cloud Architecture Tutorial - Running in the Cloud (3of3)
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
NoSQL for Netflix
NoSQL for NetflixNoSQL for Netflix
NoSQL for Netflix
 

Recently uploaded

UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Netflix in the Cloud at SV Forum

  • 1. Cloud  Architecture  at  Ne0lix   How  Ne0lix  Built  a  Scalable  Java  oriented  PaaS  running  on  AWS   SVForum  March  27th,  2012   Adrian  Cockcro9   @adrianco  #ne=lixcloud   h@p://www.linkedin.com/in/adriancockcro9  
  • 2. Adrian  Cockcro9   •  Director,  Architecture  for  Cloud  Systems,  Ne=lix  Inc.   –  Previously  Director  for  PersonalizaOon  Pla=orm   •  DisOnguished  Availability  Engineer,  eBay  Inc.  2004-­‐7   –  Founding  member  of  eBay  Research  Labs   •  DisOnguished  Engineer,  Sun  Microsystems  Inc.  1988-­‐2004   –  2003-­‐4  Chief  Architect  High  Performance  Technical  CompuOng   –  2001  Author:  Capacity  Planning  for  Web  Services   –  1999  Author:  Resource  Management   –  1995  &  1998  Author:  Sun  Performance  and  Tuning   –  1996  Japanese  EdiOon  of  Sun  Performance  and  Tuning   •   SPARC  &  Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)   •  More   –  Twi@er  @adrianco  –  Blog  h@p://perfcap.blogspot.com   –  PresentaOons  at  h@p://www.slideshare.net/adrianco  
  • 3. Why  Ne=lix,  Why  Cloud,  Why   AWS   Part  1  of  3  
  • 4. What  kind  of  Cloud?   •  So9ware  as  a  Service  –  SaaS   –  Replaces  in  house  applicaOons   –  Targets  end  users   •  Pla=orm  as  a  Service  –  PaaS   –  Replaces  in  house  operaOons  funcOons   –  Targets  developers   •  Infrastructure  as  a  Service  –  IaaS   –  Replaces  in  house  datacenter  capacity   –  Targets  developers  and  ITops  
  • 5. What  Ne=lix  Did   •  Moved  to  SaaS   –  Corporate  IT  –  OneLogin,  Workday,  Box,  Evernote…   –  Tools  –  Pagerduty,  AppDynamics,  ElasOc  MapReduce   •  Built  our  own  PaaS  <-­‐  today’s  focus   –  Customized  to  make  our  developers  producOve   –  When  we  started,  we  had  li@le  choice   •  Moved  incremental  capacity  to  IaaS   –  No  new  datacenter  space  since  2008  as  we  grew   –  Moved  our  streaming  apps  to  the  cloud  
  • 6. Why  Use  Public  Cloud?  
  • 9. Data  Center   Ne=lix  could  not   build  new   datacenters  fast   enough   Capacity  growth  is  acceleraOng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  Xbox   InternaOonal  –  Canada,  LaOn  America,  UK/Ireland  
  • 10. Ne=lix.com  is  now  ~100%  Cloud   A  few  small  back  end  data  sources  sOll  in  progress   All  internaOonal  product  is  cloud  based   USA  specific  logisOcs  remains  in  the  Datacenter   Working  on  SOX,  PCI  as  scope  starts  to  include  AWS  
  • 11. Ne=lix  Choice  was  AWS  with  our   own  pla=orm  and  tools   Unique  pla=orm  requirements  and   extreme  scale,  agility  and  flexibility  
  • 12. Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  features  and  automaOon   Use  AWS  zones  and  regions  for  high  availability,   scalability  and  global  deployment  
  • 13. But  isn’t  Amazon  a  compeOtor?   Many  products  that  compete  with  Amazon  run  on  AWS   We  are  a  “poster  child”  for  the  AWS  Architecture   Ne=lix  is  one  of  the  biggest  AWS  customers   Co-­‐opeOOon  -­‐  compeOtors  are  also  partners  
  • 14. Could  Ne=lix  use  another  cloud?   Would  be  nice,  we  use  three  interchangeable  CDN  Vendors   But  no-­‐one  else  has  the  scale  and  features  of  AWS   You  have  to  be  this  tall  to  ride  this  ride…   Maybe  in  2-­‐3  years?  
  • 15. We  want  to  use  clouds,   we  don’t  have  Ome  to  build  them   Public  cloud  for  agility  and  scale   We  use  electricity  too,  but  don’t  want  to  build  our  own  power  staOon…   AWS  because  they  are  big  enough  to  allocate  thousands  of  instances  per   hour  when  we  need  to  
  • 16. What  about  other  PaaS?   •  CloudFoundry  –  Open  Source  by  VMWare   –  Developer-­‐friendly,  easy  to  get  started   –  Missing  scale  and  some  enterprise  features   •  Rightscale   –  Widely  used  to  abstract  away  from  AWS   –  Creates  it’s  own  lock-­‐in  problem…   •  AWS  is  growing  into  this  space   –  We  didn’t  want  a  vendor  between  us  and  AWS   –  We  wanted  to  build  a  thin  PaaS,  that  gets  thinner  
  • 17. Ne=lix  Deployed  on  AWS   2009   2009   2010   2010   2010   2011   Content   Logs   Play   WWW   API   CS   Video   InternaOonal   Masters   S3   DRM   Sign-­‐Up   Metadata   CS  lookup   Device   DiagnosOcs   EC2   EMR  Hadoop   CDN  rouOng   Search   Config   &  AcOons   Movie   TV  Movie   Customer   S3   Hive   Bookmarks   Choosing   Choosing   Call  Log   Business   Social   CDNs   Logging   RaOngs   Facebook   CS  AnalyOcs   Intelligence  
  • 18. Cloud  Architecture  Pa@erns   Where  do  we  start?  
  • 19. Goals   •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenOle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user   •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verOcally  scaled  databases   –  Leverage  AWS  elasOc  capacity  effecOvely   •  Available   –  SubstanOally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulOple  AWS  availability  zones   –  No  scheduled  down  Ome,  no  central  database  schema  to  change   •  ProducOve   –  OpOmize  agility  of  a  large  development  team  with  automaOon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
  • 20. Datacenter  AnO-­‐Pa@erns   What  do  we  currently  do  in  the   datacenter  that  prevents  us  from   meeOng  our  goals?    
  • 21. Rewrite  from  Scratch   Not  everything  is  cloud  specific   Pay  down  technical  debt   Robust  pa@erns  
  • 22. Ne=lix  Datacenter  vs.  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL   SOcky  In-­‐Memory  Session   Shared  Memcached  Session   Cha@y  Protocols   Latency  Tolerant  Protocols   Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  Pa@erns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  • 23. So9ware  Architecture  Pa@erns   •  Object  Models   –  Basic  and  derived  types,  facets,  serializable   –  Pass  by  reference  within  a  service   –  Pass  by  value  between  services   •  ComputaOon  and  I/O  Models   –  Service  ExecuOon  using  Best  Effort  /  Futures   –  Common  thread  pool  management   –  Circuit  breakers  to  manage  and  contain  failures  
  • 24. Model  Driven  Architecture   •  TradiOonal  Datacenter  PracOces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa@erns   –  Some  use  of  Puppet  to  automate  changes   •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producOon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaOon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
  • 25. Ne=lix  PaaS  Principles   •  Maximum  FuncOonality   –  Developer  producOvity  and  agility   •  Leverage  as  much  of  AWS  as  possible   –  AWS  is  making  huge  investments  in  features/scale   •  Interfaces  that  isolate  Apps  from  AWS   –  Avoid  lock-­‐in  to  specific  AWS  API  details   •  Portability  is  a  long  term  goal   –  Gets  easier  as  other  vendors  catch  up  with  AWS  
  • 26. Ne=lix  Global  PaaS   •  Architecture  Features  and  Overview   •  Portals  and  Explorers   •  Pla=orm  Services   •  Pla=orm  APIs   •  Pla=orm  Frameworks   •  Persistence   •  Scalability  Benchmark  
  • 27. Global  PaaS?   Toys  are  nice,  but  this  is  the  real  thing…   •  Supports  all  AWS  Availability  Zones  and  Regions   •  Supports  mulOple  AWS  accounts  {test,  prod,  etc.}   •  Cross  Region/Acct  Data  ReplicaOon  and  Archiving   •  InternaOonalized,  Localized  and  GeoIP  rouOng   •  Security  is  fine  grain,  dynamic  AWS  keys   •  Autoscaling  to  thousands  of  instances   •  Monitoring  for  millions  of  metrics   •  ProducOve  for  100s  of  developers  on  one  product   •  23M+  users  USA,  Canada,  LaOn  America,  UK,  Eire  
  • 28. Basic  PaaS  EnOOes   •  AWS  Based  EnOOes   –  Instances  and  Machine  Images,  ElasOc  IP  Addresses   –  Security  Groups,  Load  Balancers,  Autoscale  Groups   –  Availability  Zones  and  Geographic  Regions   •  Ne=lix  PaaS  EnOOes   –  ApplicaOons  (registered  services)   –  Clusters  (versioned  Autoscale  Groups  for  an  App)   –  ProperOes  (dynamic  hierarchical  configuraOon)  
  • 29. Core  PaaS  Services   •  AWS  Based  Services   –  S3  storage,  to  5TB  files,  parallel  mulOpart  writes   –  SQS  –  Simple  Queue  Service.  Messaging  layer.   •  Ne=lix  Based  Services   –  EVCache  –  memcached  based  ephemeral  cache   –  Cassandra  –  distributed  data  store   •  External  Services   –  GeoIP  Lookup  interfaced  to  a  vendor   –  Keystore  HSM  in  Ne=lix  Datacenter  
  • 30. Instance  Architecture   Linux  Base  AMI  (CentOS  or  Ubuntu)   OpOonal   Apache   frontend,   Java  (JDK  6  or  7)   memcached,   non-­‐java  apps   Tomcat   AppDynamics   appagent   Monitoring   Log  rotaOon   ApplicaOon  servlet,  base   Healthcheck,  status   to  S3   GC  and  thread   server,  pla=orm,  interface   servlets,  JMX  interface,   AppDynamics   dump  logging   jars  for  dependent  services   Servo  autoscale   machineagent   Epic    
  • 31. Security  Architecture   •  Instance  Level  Security  baked  into  base  AMI   –  Login:  ssh  only  allowed  via  portal  (not  between  instances)   –  Each  app  type  runs  as  its  own  userid  app{test|prod}   •  AWS  Security,  IdenOty  and  Access  Management   –  Each  app  has  its  own  security  group  (firewall  ports)   –  Fine  grain  user  roles  and  resource  ACLs   •  Key  Management   –  AWS  Keys  dynamically  provisioned,  easy  updates   –  High  grade  app  specific  key  management  support  
  • 32. Portals  and  Explorers   •  Ne=lix  ApplicaOon  Console  (NAC)   –  Primary  AWS  provisioning/config  interface   •  AWS  Usage  Analyzer   –  Breaks  down  costs  by  applicaOon  and  resource   •  Cassandra  Explorer   –  Browse  clusters,  keyspaces,  column  families   •  Base  Server  Explorer   –  Browse  service  endpoints  configuraOon,  perf  
  • 33.
  • 34.
  • 35. Pla=orm  Services   •  Discovery  –  service  registry  for  “ApplicaOons”   •  IntrospecOon  –  Entrypoints   •  Cryptex  –  Dynamic  security  key  management   •  Geo  –  Geographic  IP  lookup   •  Pla=ormservice  –  Dynamic  property  configuraOon   •  LocalizaOon  –  manage  and  lookup  local  translaOons   •  Evcache  –  ephemeral  volaOle  cache   •  Cassandra  –  Cross  zone/region  distributed  data  store   •  Zookeeper  –  Distributed  CoordinaOon  (Curator)   •  Various  proxies  –  access  to  old  datacenter  stuff  
  • 36. Metrics  Framework   •  System  and  ApplicaOon   –  CollecOon,  AggregaOon,  Querying  and  ReporOng   –  Non-­‐blocking  logging,  avoids  log4j  lock  contenOon   –  Honu-­‐Streaming  -­‐>  S3  -­‐>  EMR  -­‐>  Hive   •  Performance,  Robustness,  Monitoring,  Analysis   –  Tracers,  Counters  –  explicit  code  instrumentaOon  log   –  Real  Time  Tracers/Counters   –  SLA  –  service  level  response  Ome  percenOles   –  Servo  annotated  JMX  extract  to  Cloudwatch   •  Latency  Monkey  Infrastructure   –  Inject  random  delays  into  service  responses  
  • 37. Ne0lix  Pla0orm  Persistence   •  Ephemeral  VolaOle  Cache  –  evcache   –  Discovery-­‐aware  memcached  based  backend   –  Client  abstracOons  for  zone  aware  replicaOon   –  OpOon  to  write  to  all  zones,  fast  read  from  local   •  Cassandra   –  Highly  available  and  scalable  (more  later…)   •  MongoDB   –  Complex  object/query  model  for  small  scale  use   •  MySQL   –  Hard  to  scale,  legacy  and  small  relaOonal  models  
  • 38. Priam  –  Cassandra  AutomaOon   Available  at  h@p://github.com/ne=lix   •  Ne=lix  Pla=orm  Tomcat  Code   •  Zero  touch  auto-­‐configuraOon   •  State  management  for  Cassandra  JVM   •  Token  allocaOon  and  assignment   •  Broken  node  auto-­‐replacement   •  Full  and  incremental  backup  to  S3   •  Restore  sequencing  from  S3   •  Grow/Shrink  Cassandra  “ring”  
  • 39. Astyanax   Available  at  h@p://github.com/ne=lix   •  Cassandra  java  client   •  API  abstracOon  on  top  of  Thri9  protocol   •  “Fixed”  ConnecOon  Pool  abstracOon  (vs.  Hector)   –  Round  robin  with  Failover   –  Retry-­‐able  operaOons  not  Oed  to  a  connecOon   –  Ne=lix  PaaS  Discovery  service  integraOon   –  Host  reconnect  (fixed  interval  or  exponenOal  backoff)   –  Token  aware  to  save  a  network  hop  –  lower  latency   –  Latency  aware  to  avoid  compacOng/repairing  nodes  –  lower  variance   •  Batch  mutaOon:  set,  put,  delete,  increment   •  Simplified  use  of  serializers  via  method  overloading  (vs.  Hector)   •  ConnecOonPoolMonitor  interface  for  counters  and  tracers   •  Composite  Column  Names  replacing  deprecated  SuperColumns  
  • 40. Astyanax  Query  Example   Paginate  through  all  columns  in  a  row   ColumnList<String>  columns;   int  pageize  =  10;   try  {          RowQuery<String,  String>  query  =  keyspace                  .prepareQuery(CF_STANDARD1)                  .getKey("A")                  .setIsPaginaOng()                  .withColumnRange(new  RangeBuilder().setMaxSize(pageize).build());                                      while  (!(columns  =  query.execute().getResult()).isEmpty())  {                  for  (Column<String>  c  :  columns)  {                  }          }   }  catch  (ConnecOonExcepOon  e)  {   }      
  • 41. High  Availability   •  Cassandra  stores  3  local  copies,  1  per  zone   –  Synchronous  access,  durable,  highly  available   –  Read/Write  One  fastest,  least  consistent  -­‐  ~1ms   –  Read/Write  Quorum  2  of  3,  consistent  -­‐  ~3ms   •  AWS  Availability  Zones   –  Separate  buildings   –  Separate  power  etc.   –  Fairly  close  together    
  • 42. “TradiOonal”  Cassandra  Write  Data  Flows   Single  Region,  MulOple  Availability  Zone,  Not  Token  Aware   Cassandra   • Disks   • Zone  A   2   2   4   2   1.  Client  Writes  to  any   Cassandra  3   3   Cassandra   If  a  node  goes  offline,   Cassandra  Node   • Disks   5 • Disks   5   hinted  handoff   2.  Coordinator  Node   • Zone  C   1 • Zone  A   completes  the  write   replicates  to  nodes   when  the  node  comes   and  Zones   Non  Token   back  up.   3.  Nodes  return  ack  to   Aware     coordinator   Clients   Requests  can  choose  to   4.  Coordinator  returns   3   wait  for  one  node,  a   Cassandra   Cassandra   ack  to  client   • Disks   • Disks   5   quorum,  or  all  nodes  to   5.  Data  wri@en  to   • Zone  C   • Zone  B   ack  the  write   internal  commit  log     disk  (no  more  than   Cassandra   SSTable  disk  writes  and   • Disks   10  seconds  later)   • Zone  B   compacOons  occur   asynchronously  
  • 43. Astyanax  -­‐  Cassandra  Write  Data  Flows   Single  Region,  MulOple  Availability  Zone,  Token  Aware   Cassandra   • Disks   • Zone  A   1.  Client  Writes  to   Cassandra  2   2   Cassandra   If  a  node  goes  offline,   nodes  and  Zones   • Disks   3 • Disks   3   hinted  handoff   2.  Nodes  return  ack  to   • Zone  C   1 • Zone  A   completes  the  write   client   3.  Data  wri@en  to   Token   when  the  node  comes   back  up.   internal  commit  log   Aware     disks  (no  more  than   Clients   2   Requests  can  choose  to   10  seconds  later)   Cassandra   Cassandra   wait  for  one  node,  a   • Disks   • Disks   3   quorum,  or  all  nodes  to   • Zone  C   • Zone  B   ack  the  write     Cassandra   SSTable  disk  writes  and   • Disks   • Zone  B   compacOons  occur   asynchronously  
  • 44. Data  Flows  for  MulO-­‐Region  Writes   Token  Aware,  Consistency  Level  =  Local  Quorum   1.  Client  writes  to  local  replicas   If  a  node  or  region  goes  offline,  hinted  handoff   2.  Local  write  acks  returned  to   completes  the  write  when  the  node  comes  back  up.   Client  which  conOnues  when   Nightly  global  compare  and  repair  jobs  ensure   2  of  3  local  nodes  are   everything  stays  consistent.   commi@ed   3.  Local  coordinator  writes  to   remote  coordinator.     Cassandra   100+ms  latency   4.  When  data  arrives,  remote   Cassandra   •  Disks   •  Disks   •  Zone  A   •  Zone  A   coordinator  node  acks  and   Cassandra   2   2   Cassandra   Cassandra   4   Cassandra   6   6   3   5   Disks  6   copies  to  other  remote  zones   6   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  •  Zone  C   4  Disks  A   •  •  Zone   1   4   5.  Remote  nodes  ack  to  local   US   EU   coordinator   Clients   Clients   Cassandra   2   Cassandra   Cassandra   5   Cassandra   6.  Data  flushed  to  internal   •  Disks   •  Zone  C   •  Disks   6   •  Zone  B   •  Disks   •  Zone  C   •  Disks  6   •  Zone  B   commit  log  disks  (no  more   Cassandra   Cassandra   than  10  seconds  later)   •  Disks   •  Disks   •  Zone  B   •  Zone  B  
  • 45. Cassandra  Backup     •  Full  Backup   Cassandra   Cassandra   Cassandra   –  Time  based  snapshot   –  SSTable  compress  -­‐>  S3   Cassandra   Cassandra   •  Incremental   S3   Backup   Cassandra   Cassandra   –  SSTable  write  triggers   compressed  copy  to  S3   Cassandra   Cassandra   •  Archive   Cassandra   Cassandra   –  Copy  cross  region   A  
  • 46. ETL  for  Cassandra   •  Data  is  de-­‐normalized  over  many  clusters!   •  Too  many  to  restore  from  backups  for  ETL   •  SoluOon  –  read  backup  files  using  Hadoop   •  Aegisthus   –  h@p://techblog.ne=lix.com/2012/02/aegisthus-­‐bulk-­‐data-­‐pipeline-­‐out-­‐of.html   –  High  throughput  raw  SSTable  processing   –  Re-­‐normalizes  many  clusters  to  a  consistent  view   –  Extract,  Transform,  then  Load  into  Teradata  
  • 47. Cassandra  Archive   A   Appropriate  level  of  paranoia  needed…   •  Archive  could  be  un-­‐readable   –  Restore  S3  backups  weekly  from  prod  to  test,  and  daily  ETL   •  Archive  could  be  stolen   –  PGP  Encrypt  archive   •  AWS  East  Region  could  have  a  problem   –  Copy  data  to  AWS  West   •  ProducOon  AWS  Account  could  have  an  issue   –  Separate  Archive  account  with  no-­‐delete  S3  ACL   •  AWS  S3  could  have  a  global  problem   –  Create  an  extra  copy  on  a  different  cloud  vendor….  
  • 48. Tools  and  AutomaOon   •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArOfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches   •  Custom  Ne=lix  ApplicaOon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producOon   •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop   –  Datastax  support  for  Cassandra,  AWS  support  for  Hadoop  via  EMR   •  Monitoring  Tools   –  Alert  processing  gateway  into  Pagerduty   –  AppDynamics  –  Developer  focus  for  cloud  h@p://appdynamics.com  
  • 49. Open  Source  Strategy   •  Release  PaaS  Components  git-­‐by-­‐git   –  Source  at  github.com/ne=lix   –  Intros  and  techniques  at  techblog.ne=lix.com   –  Blog  post  or  new  code  every  week  or  so   •  MoOvaOons   –  Give  back  to  Apache  licensed  OSS  community   –  MoOvate,  retain,  hire  top  engineers   –  Create  a  community  that  adds  features  and  fixes  
  • 50. Current  OSS  Projects  and  Posts   Github  /  Techblog   Priam   Exhibitor   Servo   Apache  Project   Techblog  Post   Astyanax   Curator   Autoscaling  scripts   CassJMeter   Zookeeper   Honu   Cassandra   EVCache   Circuit  Breaker   Aegisthus  
  • 51. Scalability  TesOng   •  Cloud  Based  TesOng  –  fricOonless,  elasOc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel   •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra   •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaOon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
  • 53. Scale-­‐Up  Linearity   h@p://techblog.ne=lix.com/2011/11/benchmarking-­‐cassandra-­‐scalability-­‐on.html   Client  Writes/s  by  node  count  –  ReplicaJon  Factor  =  3   1200000   1099837   1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
  • 55. Chaos  Monkey   •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency   •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact   •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill   •  ApplicaOon  configuraOon  opOon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
  • 56. Responsibility  and  Experience   •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail   •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”   •  Keep  Omeouts  short,  fail  fast   –  Don’t  let  cascading  Omeouts  stack  up   •  Make  configuraOon  opOons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opOon  
  • 57. Resilient  Design  –  Circuit  Breakers   h@p://techblog.ne=lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  
  • 58. PaaS  OperaOonal  Model   •  Developers   –  Provision  and  run  their  own  code  in  producOon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs   •  DevOps  and  PaaS  (aka  NoOps)   –  DevOps  is  used  to  build  and  run  the  PaaS   –  PaaS  constrains  Dev  to  use  automaOon  instead   –  PaaS  puts  more  responsibility  on  Dev,  with  tools  
  • 59. What’s  Le9  for  Corp  IT?   •  Corporate  Security  and  Network  Management   –  Billing  and  remnants  of  streaming  service  back-­‐ends  in  DC   •  Running  Ne=lix’  DVD  Business   –  Tens  of  Oracle  instances   Corp  WiFi  Performance   –  Hundreds  of  MySQL  instances   –  Thousands  of  VMWare  VMs   –  Zabbix,  CacO,  Splunk,  Puppet   •  Employee  ProducOvity   –  Building  networks  and  WiFi   –  SaaS  OneLogin  SSO  Portal   –  Evernote  Premium,  Safari  Online  Bookshelf,  Dropbox  for  Teams   –  Google  Enterprise  Apps,  Workday  HCM/Expense,  Box.com   –  Many  more  SaaS  migraOons  coming…  
  • 60. ImplicaOons  for  IT  OperaOons   •  Cloud  is  run  by  developer  organizaOon   –  Product  group’s  “IT  department”  is  the  AWS  API  and  PaaS   –  CorpIT  handles  billing  and  some  security  funcOons   Cloud  capacity  is  10x  bigger  than  Datacenter   –  Datacenter  oriented  IT  didn’t  scale  up  as  we  grew   –  We  moved  a  few  people  out  of  IT  to  do  DevOps  for  our  PaaS   •  TradiOonal  IT  Roles  and  Silos  are  going  away   –  We  don’t  have  SA,  DBA,  Storage,  Network  admins  for  cloud   –  Developers  deploy  and  “run  what  they  wrote”  in  producOon  
  • 61. Ne=lix  PaaS  OrganizaOon   Developer  Org  ReporOng  into  Product  Development,  not  ITops   Ne=lix  Cloud  Pla=orm  Team   Cloud  Ops   Build  Tools   Pla=orm  and   Cloud   Cloud   Reliability   Architecture   and   Database   Performance   SoluOons   Engineering   AutomaOon   Engineering   Perforce  Jenkins   Pla=orm  jars   Cassandra   Future  planning   ArOfactory  JIRA   Benchmarking   Monitoring   Alert  RouOng   Key  store   Security  Arch   Monkeys   Incident  Lifecycle   Base  AMI,  Bakery   Zookeeper   JVM  GC  Tuning   Efficiency   Ne=lix  App  Console   Wiresharking   Entrypoints   Cassandra   AWS  VPC   PagerDuty   Hyperguard   AWS  API   AWS  Instances   AWS  Instances   AWS  Instances   Powerpoint  J  
  • 62. Roadmap  for  2012   •  Readiness  for  global  Ne=lix  launches   •  More  resiliency  and  improved  availability   •  More  automaOon,  orchestraOon   •  “Hardening”  the  pla=orm   •  Lower  latency  for  web  services  and  devices   •  Working  towards  IPv6  support   •  More  open  sourced  components  
  • 63. Wrap  Up     Answer  your  remaining  quesOons…     What  was  missing  that  you  wanted  to  cover?     Next  up  –  Jason  Chan  on  Security  Architecture  
  • 64. Takeaway     Ne>lix  has  built  and  deployed  a  scalable  global  Pla>orm  as  a  Service.     Key  components  of  the  Ne>lix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     h@p://github.com/Ne=lix   h@p://techblog.ne=lix.com   h@p://slideshare.net/Ne=lix     h@p://www.linkedin.com/in/adriancockcro9   @adrianco  #ne=lixcloud