• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cloud Architecture Tutorial - Running in the Cloud (3of3)
 

Cloud Architecture Tutorial - Running in the Cloud (3of3)

on

  • 30,925 views

Part 3 of the talk covers how to transition to cloud, how to bootstrap developers, how to run cloud services including Cassandra, capacity planning and workload analysis, and organizational structure

Part 3 of the talk covers how to transition to cloud, how to bootstrap developers, how to run cloud services including Cassandra, capacity planning and workload analysis, and organizational structure

Statistics

Views

Total Views
30,925
Views on SlideShare
8,986
Embed Views
21,939

Actions

Likes
45
Downloads
0
Comments
0

72 Embeds 21,939

http://perfcap.blogspot.com 11657
http://perfcap.blogspot.in 4034
http://perfcap.blogspot.co.uk 1373
http://perfcap.blogspot.ca 869
http://perfcap.blogspot.com.au 461
http://perfcap.blogspot.de 390
http://understeer.hatenablog.com 342
http://perfcap.blogspot.it 320
http://perfcap.blogspot.fr 298
http://perfcap.blogspot.com.es 277
http://perfcap.blogspot.nl 181
http://perfcap.blogspot.com.br 163
http://perfcap.blogspot.se 160
http://perfcap.blogspot.ie 158
http://perfcap.blogspot.co.il 146
http://www.perfcap.blogspot.com 115
http://perfcap.blogspot.kr 100
http://perfcap.blogspot.jp 93
http://perfcap.blogspot.dk 62
http://perfcap.blogspot.co.nz 57
http://perfcap.blogspot.sg 56
http://perfcap.blogspot.com.ar 53
http://perfcap.blogspot.mx 45
http://www.newsblur.com 39
http://perfcap.blogspot.ch 36
http://perfcap.blogspot.ro 35
http://perfcap.blogspot.be 34
http://perfcap.blogspot.no 32
http://perfcap.blogspot.fi 31
http://perfcap.blogspot.pt 30
http://perfcap.blogspot.cz 26
http://perfcap.blogspot.gr 26
http://www.techgig.com 24
http://perfcap.blogspot.tw 24
http://perfcap.blogspot.hk 21
http://perfcap.blogspot.ru 20
http://perfcap.blogspot.co.at 19
http://news.google.com 17
http://perfcap.blogspot.hu 16
http://webcache.googleusercontent.com 15
http://newsblur.com 13
http://staging.notouching.me 7
http://kms.sec.samsung.net 5
http://feedly.com 5
http://bo.lt 4
http://translate.googleusercontent.com 4
http://perfcap.blogspot.ae 4
http://digg.com 4
http://us-w1.rockmelt.com 4
http://cloud.feedly.com 3
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cloud Architecture Tutorial - Running in the Cloud (3of3) Cloud Architecture Tutorial - Running in the Cloud (3of3) Presentation Transcript

    • Cloud  Architecture  Tutorial   Running  in  the  Cloud   Qcon  London  March  5th,  2012   Adrian  Cockcro6   @adrianco  #ne:lixcloud   h>p://www.linkedin.com/in/adriancockcro6   Part  3  of  3  
    • Running  in  the  Cloud  Bring-­‐up  Strategy  for  Developers  and  TesRng   Capacity  Planning  and  Workloads   Running  Cassandra   Monitoring  and  Scalability   Availability  and  Resilience   OrganizaRonal  Structure        
    • Cloud  Bring-­‐Up  Strategy   Simplest  and  Soonest  
    • Shadow  Traffic  RedirecRon  •  Early  a>empt  to  send  traffic  to  cloud   –  Real  traffic  stream  to  validate  cloud  back  end   –  Uncovered  lots  of  process  and  tools  issues   –  Uncovered  Service  latency  issues  •  TV  Device  calls  Datacenter  API  and  Cloud  API   –  Returns  Genre/movie  list  for  a  customer   –  Asynchronously  duplicate  request  to  cloud   –  Start  with  send-­‐and-­‐forget  mode,  ignore  response  
    • Shadow  Redirect  Instances   Modified   Datacenter   Datacenter   Service   Instances  Modified  Cloud   Cloud  Service   One  request  per   Instances   visit   Data  Sources   queueservice   videometadata  
    • Video  Metadata  Server  •  VMS  instance  isolates  new  pla:orm  from  old  codebase   –  Isolate/unblock  cloud  team  from  metadata  team  schedule   –  Datacenter  code  supports  obsolete  movie  object   –  VMS  ESL  is  designed  to  support  new  video  facet  object  •  VMS  subsets  and  pre-­‐processes  the  metadata   –  Only  load  data  used  by  cloud  services   –  Fast  bulk  loads  for  VMS  clients  speed  startup  Rmes   –  Explore  next  generaRon  metadata  cache  architecture     Pa$ern  –  Add  services  to  isolate  old  and  new  code  base  
    • First  Web  Pages  in  the  Cloud  
    • First  Page  •  First  full  page  –  Basic  Genre   –  Simplest  page,  no  sub-­‐genres,  minimal  personalizaRon   –  Lots  of  investment  in  new  Struts  based  page  design   –  Uses  idenRty  cookie  to  lookup  in  member  info  svc  •  New  “merchweb”  front  end  instance   –  movies.ne:lix.com  points  to  merchweb  instance  •  Uncovered  lots  of  latency  issues   –  Used  memcached  to  hide  S3  and  SimpleDB  latency   –  Improved  from  slower  to  faster  than  Datacenter  
    • Genre  Page  Cloud  Instances   Front  End   merchweb   mulRple  requests   Middle  Tier   genre    memcached   per  visit  Data  Sources   queueservice   rentalhistory   videometadata  
    • Controlled  Cloud  TransiRon  •  WWW  calling  code  chooses  who  goes  to  cloud   –  Filter  out  corner  cases,  send  percentage  of  users   –  The  URL  that  customers  see  is   h>p://movies.ne:lix.com/WiContentPage?csid=1   –  If  problem,  redirect  to  old  Datacenter  page   h>p://www.ne:lix.com/WiContentPage?csid=1  •  Play  Bu>on  and  Star  RaRng  AcRon  redirect   –  Point  URLs  for  acRons  that  create/modify  data   back  to  datacenter  to  start  with  
    • Cloud  Development  and  TesRng  Issues  
    • Boot  Camp  •  One  day  “Ne:lix  Cloud  Training”  class   –  Has  been  run  6  Rmes  for  20-­‐45  people  each  Rme  •  Half  day  of  presentaRons  •  Half  day  hands-­‐on   –  Create  your  own  hello  world  app   –  Launch  in  AWS  test  account   –  Login  to  your  cloud  instances   –  Find  monitoring  data  on  your  cloud  instances   –  Connect  to  Cassandra  and  read/write  data  
    • Very  First  Boot  Camp  •  Pathfinder  Bootstrap  Mission   –  Room  full  of  engineers  sharing  the  pain  for  1-­‐2  days   –  Built  a  very  rough  prototype  working  web  site  •  Get  everyone  hands-­‐on  with  a  new  code  base   –  Debug  lots  of  tooling  and  conceptual  issues  very  fast   –  Used  SimpleDB  to  create  mock  data  sources  •  Cloud  Specific  Key  Setup   –  Needed  to  integrate  with  AWS  security  model   –  New  concepts  for  datacenter  developers  
    • Developer  Instances  Collision  Sam  and  Rex  both  want  to  deploy  web  front  end  for   development   Sam   Rex   web  in   test   account  
    • Per-­‐Service  Namespace  Stack  RouRng   Developers  choose  what  to  share   Sam   Rex   Mike   web-­‐sam   web-­‐rex   web-­‐dev  backend-­‐dev   backend-­‐dev   backend-­‐mike  
    • Developer  Namespace  Stacks  •  Developer  specific  service  instances   –  Configured  via  Java  properRes  at  runRme   –  RouRng  implemented  by  REST  client  library  •  Server  ConfiguraRon   –  Configure  discovery  service  version  string   –  Registers  as  <appname>-­‐<namespace>  •  Client  ConfiguraRon   –  Route  traffic  on  per-­‐service  basis  including   namespace  
    • Capacity  Planning  Metrics  and   Methods  
    • What  is  Capacity  Planning  •  We  care  about   –  CPU,  Memory,  Network  and  Disk  resource  uRlizaRon   –  ApplicaRon  response  Rmes  and  throughput  •  We  need  to  know   –  how  much  of  each  resource  we  are  using  now,  and  will  use  in   the  future   –  how  much  headroom  we  have  to  handle  higher  loads  •  We  want  to  understand   –  how  headroom  varies   –  how  it  relates  to  applicaRon  response  Rmes  and  throughput  
    • Capacity  Planning  Norms  •  Capacity  is  expensive  •  Capacity  takes  Rme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  
    • Capacity  Planning  in  Clouds   (a  few  things  have  changed…)  •  Capacity  is  expensive  •  Capacity  takes  Rme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  (reservaRons!)  
    • Capacity  is  expensive   h>p://aws.amazon.com/s3/  &  h>p://aws.amazon.com/ec2/  •  Storage  (Amazon  S3)     –  $0.125  per  GB  –  first  50  TB  /  month  of  storage  used   –  $0.055  per  GB  –  storage  used  /  month  over  5  PB  •  Data  Transfer  (Amazon  S3)     –  $0.000  per  GB  –  all  data  transfer  in  is  free,  first  GB  out  is  free   –  $0.120  per  GB  –  first  10  TB  /  month  data  transfer  out   –  $0.050  per  GB  –  data  transfer  out  /  month  over  350  TB  •  Requests  (Amazon  S3  Storage  access  is  via  h>p)   –  $0.01  per  1,000  PUT,  COPY,  POST,  or  LIST  requests   –  $0.01  per  10,000  GET  and  all  other  requests   –  $0  per  DELETE  •  CPU  (Amazon  EC2)   –  Small  (Default)  $0.085/hour,  Extra  Large  $0.68/hour,  Four  XL  $2.00/hour   –  Small  (Default)  $0.08/hour,  Extra  Large  $0.64/hour,  Four  XL  $1.80/hour  •  Network  (Amazon  EC2)   –  Inbound/Outbound  around  $0.10  per  GB  
    • Capacity  comes  in  big  chunks,  paid  up  front  •  Capacity  takes  Rme  to  buy  and  provision   –  No  minimum  price,  monthly  billing   –  “Amazon  EC2  enables  you  to  increase  or  decrease   capacity  within  minutes,  not  hours  or  days.  You  can   commission  one,  hundreds  or  even  thousands  of   server  instances  simultaneously”  •  Capacity  only  increases,  can’t  be  shrunk  easily   –  Pay  for  what  is  actually  used  •  Planning  errors  can  cause  big  problems   –  Size  only  for  what  you  need  now  
    • Systems  are  clearly  defined  assets  •  You  are  running  in  a  “stateless”  mulR-­‐ tenanted  virtual  image  that  can  die  or  be   taken  away  and  replaced  at  any  Rme  •  You  don’t  know  exactly  where  it  is,  you  can   choose  to  locate  “US-­‐East”  or  “Europe”  etc.  •  You  can  specify  zones  that  will  not  share   components  to  avoid  common  mode  failures  
    • Systems  can  be  instrumented  in  detail  •  Each  cloud  node  allocaRon  is  unique   –  So  elasRc  usage  pa>erns  keep  creaRng  new  nodes   –  “garbage  collect”  nodes  that  won’t  be  seen  again   –  Need  to  map  EIP  and  Cassandra  tokens  to  instances  •  Ne:lix  SoluRon  –  Entrypoints  Slots   –  Each  Autoscale  Group  has  a  size   –  Each  instance  is  given  a  slot  number  up  to  size   –  Replacements  pick  empty  slots  
    • Depreciate  assets  over  3  years   (reservaRons!)  •  Reduced  costs  in  return  for  commitment  •  One  or  three  years,  upfront  payment  •  Payment  can  be  depreciated  as  capital  asset  •  Low,  medium  or  high  usage  reservaRons   –  Save  more  if  you  use  them  more  •  Spot  market  instances   –  Unused  reservaRons  sold  to  other  users  cheap   –  Will  be  yanked  at  any  Rme  if  needed  
    • A  Discussion  of  Workloads  and   How  They  Behave  
    • Workload  CharacterisRcs   •  A  quick  tour  through  a  taxonomy  of   workload  types   •  Start  with  the  easy  ones  and  work  up   •  Why  personalized  workloads  are  different   and  hard   •  Some  examples  and  coping  strategies  3/12/12   Slide  176  
    • Simple  Random  Arrivals   •  Random  arrival  of  transacRons  with  fixed  mean   service  Rme   –  Li>le’s  Law:  QueueLength  =  Throughput  *  Response   –  URlizaRon  Law:  URlizaRon  =  Throughput  *  ServiceTime   •  Complex  models  are  o6en  reduced  to  this  model   –  By  averaging  over  longer  Rme  periods  since  the  formulas   only  work  if  you  have  stable  averages   –  By  wishful  thinking  (i.e.  how  to  fool  yourself)  3/12/12   Slide  177  
    • Mixed  random  arrivals  of  transacRons   with  stable  mean  service  Rmes   •  Think  of  the  grocery  store  checkout  analogy   –  Trolleys  full  of  shopping  vs.  baskets  full  of  shopping   –  Baskets  are  quick  to  service,  but  get  stuck  behind  carts   –  RelaRve  mixture  of  transacRon  types  starts  to  ma>er   •  Many  transacRonal  systems  handle  a  mixture   –  Databases,  web  services   •  Consider  separaRng  fast  and  slow  transacRons   –  So  that  we  have  a  “10  items  or  less”  line  just  for  baskets   –  Separate  pools  of  servers  for  different  services   –  The  old  rule  -­‐  don’t  mix  OLTP  with  DSS  queries  in  databases   •  Performance  is  o6en  thread-­‐limited   –  Thread  limit  and  slow  transacRons  constrains  maximum  throughput   •  Model  mix  using  analyRcal  solvers  (e.g.  PDQ  perfdynamics.com)  3/12/12   Slide  178  
    • Load  dependent  servers  –  varying   mean  service  Rmes  •  Mean  service  Rme  may  increase  at  high  throughput   –  Due  to  non-­‐scalable  algorithms,  lock  contenRon   –  System  runs  out  of  memory  and  starts  paging  or  frequent  GC  •  Mean  service  Rme  may  also  decrease  at  high  throughput   –  Elevator  seek  and  write  cancellaRon  opRmizaRons  in  storage   –  Load  shedding  and  simplified  fallback  modes  •  Systems  have  “Rpping  points”  if  the  service  Rme  increases   –  Hysteresis  means  they  don’t  come  back  when  load  drops   –  This  is  why  you  have  to  kill  catatonic  systems   –  Best  designs  shed  load  to  be  stable  at  the  limit  –  circuit  breaker  pa>ern   –  PracRcal  opRon  is  to  try  to  avoid  Rpping  points  by  reducing  variance    •  Model  using  discrete  event  simulaRon  tools   –  Behaviour  is  non-­‐linear  and  hard  to  model  3/12/12   Slide  179  
    • Self-­‐similar  /  fractal  workloads  •  Bursty  rather  than  random  arrival  rates  •  Self-­‐similar   –  Looks  “random”  at  close  up,  stays  “random”  as  you  zoom  out   –  Work  arrives  in  bursts,  transacRons  aren’t  independent   –  Bursts  cluster  together  in  super-­‐bursts,  etc.  •  Network  packet  streams  tend  to  be  fractal  •  Common  in  pracRce,  too  hard  to  model   –  Probably  the  most  common  reason  why  your  model  is  wrong!  3/12/12   Slide  180  
    • State  Dependent  Service  Workloads   •  Personalized  services  that  store  user  state/history   –  TransacRons  for  new  users  are  quick   –  TransacRons  for  users  with  lots  of  state/history  are  slower   –  As  user  base  builds  state  and  ages  you  get  into  trouble…   •  Social  Networks,  RecommendaRon  Services   –  Facebook,  Flickr,  Ne:lix,  Twi>er  etc.   •  “Abandon  hope  all  ye  who  enter  here”   –  Not  tractable  to  model,  repeatable  tests  are  tricky   –  Long  fat  tail  response  Rme  distribuRon  and  Rmeouts   •  Try  to  transform  workloads  to  more  tractable  forms  3/12/12   Slide  181  
    • Example  -­‐  Twi>er  Workload  •  @adrianco  tweets  –  copy  to  3600  or  so  other  users  •  @zoecello  tweets  many  Rmes  a  day    –  to  over  1M  users  •  @barackobama  tweets  every  few  days  –  to  over  12M  users  •  It’s  the  same  transacRon,  but  the  service  Rme  varies  by  several   orders  of  magnitude  •  The  best  (most  acRve  and  connected  =  most  valuable)  users   trigger  a  “denial  of  service  a>ack”  on  the  systems  when  they   tweet  •  Cascading  effect  as  many  others  re-­‐tweet  3/12/12   Slide  182  
    • Example  -­‐  Ne:lix  Movie  Choosing   •  “Pick  24  genres/subgenres  etc.  of  75  movies  each  for  me”   –  used  by  TV  based  devices  like  Xbox360,  PS/3,  iPhone  app   •  New  user   –  No  history  of  what  they  have  rented  (DVD)  or  streamed   –  No  star  raRngs  for  movies,  possibly  some  genre  raRngs   –  Basic  demographic  info   –  Fast  to  calculate,  easy  to  find  many  good  choices  to  return   •  User  with  several  years  tenure   –  Thousands  of  movies  rented  or  streamed,  “seen  it  already”   –  Hundreds  to  thousands  of  star  raRngs,  lots  of  genre  raRngs   –  Requests  may  Rme  out  and  return  fewer  or  worse  choices  3/12/12   Slide  183  
    • Workload  Modelling  Survival   Methods   •  Simplify  the  workload  algorithms   –  move  from  hard  or  impossible  to  simpler  models   –  decouple,  cache  and  pre-­‐compute  to  get  constant  service  Rmes   •  Stand  further  away   –  averaging  is  your  friend  –  gets  rid  of  complex  fluctuaRons   •  Minimalist  Models   –  most  models  are  far  too  complex  –  the  classic  beginners  error…   –  the  art  of  modelling  is  to  only  model  what  really  ma>ers   •  Don’t  model  details  you  don’t  use   –  model  peak  hour  of  the  week,  not  day  to  day  fluctuaRons   –  e.g.  “Will  the  web  site  survive  next  Sunday  night?”  3/12/12   Slide  184  
    • Running  Cassandra  
    • Cassandra  Use  Cases  •  Key  by  Customer  –  Cross-­‐region  clusters   –  Many  app  specific  Cassandra  clusters,  read-­‐intensive   –  Keys+Rows  in  memory  using  m2.4xl  Instances  •  Key  by  Customer:Movie  –  e.g.  Viewing  History   –  Growing  fast,  write  intensive  –  m1.xl  instances   –  Keys  cached  in  memory,  one  cluster  per  region  •  Large  scale  data  logging  –  lots  of  writes   –  Column  data  expires  a6er  Rme  period   –  Distributed  counters,  one  cluster  per  region  
    • Ne:lix  Pla:orm  Cassandra  AMI  •  Tomcat  server  with  Priam   –  Always  running,  registers  with  pla:orm   –  Manages  Cassandra  state,  tokens,  backups  •  Removed  Root  Disk  Dependency  on  EBS   –  Use  S3  backed  AMI  for  stateful  services   –  Normally  use  EBS  backed  AMI  for  fast  provisioning  
    • Ne:lix  ContribuRons  to  Cassandra  •  Cassandra  as  a  mutable  toolkit   –  Cassandra  is  in  Java,  pluggable,  well  structured   –  Ne:lix  has  a  building  full  of  Java  engineers….   –  We  changed  Cassandra  to  make  it  run  much  be>er  on  AWS  •  ContribuRons  delivered  to  Cassandra   –  0.8  Prototype  off-­‐heap  row  cache,  SSTable  write  callback   –  1.x  OpRmizaRons  reduced  impact  of  repair  &  compacRon   –  January  2012  –  Ne:lix  engineer  becomes  core  commi>er  •  Cassandra  Based  Projects  on  github.com/Ne:lix   –  Priam  AWS  integraRon  and  backup  using  Tomcat  helper   –  Astyanax    Java  client  library   –  CassJMeter  for  performance  and  regression  tesRng  
    • Monitoring  Tools  
    • Monitoring  Vision  •  Problem   –  Too  many  tools,  each  with  a  good  reason  to  exist   –  Hard  to  get  an  integrated  view  of  a  problem   –  Too  much  manual  work  building  dashboards   –  Tools  are  not  discoverable,  views  are  not  filtered  •  SoluRon   –  Get  vendors  to  add  deep  linking  and  embedding   –  IntegraRon  “portal”  Res  everything  together   –  Dynamic  portal  generaRon,  relevant  data,  all  tools  
    • Cloud  Monitoring  Mechanisms  •  Keynote  or  Gomez  etc.   –  External  URL  monitoring  •  Amazon  CloudWatch   –  Metrics  for  ELB  and  Instances  •  AppDynamics   –  End  to  end  transacRon  view  showing  resources  used   –  Powerful  real  Rme  debug  tools  for  latency,  CPU  and  Memory  •  Epic  (Ne:lix  in-­‐house  project)   –  Flexible  and  easy  to  use  to  extend  and  embed  plots  •  Logs   –  High  capacity  logging  and  analysis  framework   –  Hadoop  (log4j  -­‐>  Honu  -­‐>  EMR)  
    • Using  AppDynamics  (simple  example  from  early  2010)  
    • AppDynamics  Monitoring  of  Cassandra  –  AutomaRc  Discovery  
    • Scalability  TesRng  •  Cloud  Based  TesRng  –  fricRonless,  elasRc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel  •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra  •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaRon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
    • <DrEvil>ONE  MILLION</DrEvil>  
    • Scale-­‐Up  Linearity   h>p://techblog.ne:lix.com/2011/11/benchmarking-­‐cassandra-­‐scalability-­‐on.html   Client  Writes/s  by  node  count  –  Replica:on  Factor  =  3  1200000   1099837  1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
    • Stress  Client  Latency   Includes  ~10ms  Scheduling  Overhead  –  for  be>er  latency  data  see    h>p://techblog.ne:lix.com/2012/03/jmeter-­‐plugin-­‐for-­‐cassandra.html  
    • Measured  at  the  Cassandra  Server  3.3  Million  writes/sec  at  0.014ms  –  14  microseconds  
    • Per  Node  AcRvity   Per  Node   48  Nodes   96  Nodes   144  Nodes   288  Nodes  Per  Server  Writes/s   10,900  w/s   11,460  w/s   11,900  w/s   11,456  w/s  Mean  Server  Latency   0.0117  ms   0.0134  ms   0.0148  ms   0.0139  ms  Mean  CPU  %Busy   74.4  %   75.4  %   72.5  %   81.5  %  Disk  Read   5,600  KB/s   4,590  KB/s   4,060  KB/s   4,280  KB/s  Disk  Write   12,800  KB/s   11,590  KB/s   10,380  KB/s   10,080  KB/s  Network  Read   22,460  KB/s   23,610  KB/s   21,390  KB/s   23,640  KB/s  Network  Write   18,600  KB/s   19,600  KB/s   17,810  KB/s   19,770  KB/s   Node  specificaRon  –  Xen  Virtual  Images,  AWS  US  East,  three  zones   •  Cassandra  0.8.6,  CentOS,  SunJDK6   •  AWS  EC2  m1  Extra  Large  –  Standard  price  $  0.68/Hour   •  15  GB  RAM,  4  Cores,  1Gbit  network   •  4  internal  disks  (total  1.6TB,  striped  together,  md,  XFS)  
    • Time  is  Money   48  nodes   96  nodes   144  nodes   288  nodes  Writes  Capacity   174373  w/s   366828  w/s   537172  w/s   1,099,837  w/s  Storage  Capacity   12.8  TB   25.6  TB   38.4  TB   76.8  TB  Nodes  Cost/hr   $32.64   $65.28   $97.92   $195.84  Test  Driver  Instances   10   20   30   60  Test  Driver  Cost/hr   $20.00   $40.00   $60.00   $120.00  Cross  AZ  Traffic   5  TB/hr   10  TB/hr   15  TB/hr   301  TB/hr  Traffic  Cost/10min   $8.33   $16.66   $25.00   $50.00  Setup  DuraRon   15  minutes   22  minutes   31  minutes   662  minutes  AWS  Billed  DuraRon   1hr   1hr   1  hr   2  hr  Total  Test  Cost   $60.97   $121.94   $182.92   $561.68   1  EsRmate  two  thirds  of  total  network  traffic     2  Workaround  for  a  tooling  bug  slowed  setup  
    • Availability  and  Resilience  
    • Chaos  Monkey  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill  •  ApplicaRon  configuraRon  opRon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
    • Responsibility  and  Experience  •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail  •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”  •  Keep  Rmeouts  short,  fail  fast   –  Don’t  let  cascading  Rmeouts  stack  up  •  Make  configuraRon  opRons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opRon  
    • Resilient  Design  –  Circuit  Breakers  h>p://techblog.ne:lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  
    • PaaS  OperaRonal  Model  -­‐  NoOps  •  Developers   –  Provision  and  run  their  own  code  in  producRon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs  •  Difference  between  DevOps  and  NoOps   –  DevOps  is  about  Dev  and  Ops  working  together   –  NoOps  constrains  Dev  to  use  automaRon  instead   –  NoOps  puts  more  responsibility  on  Dev,  with  tools  
    • ImplicaRons  for  IT  OperaRons  •  Cloud  is  run  by  developer  organizaRon   –  Our  IT  department  is  the  AWS  API   –  We  have  no  IT  staff  working  on  cloud  (they  do  corp  IT)  •  Cloud  capacity  is  10x  bigger  than  Datacenter   –  Datacenter  oriented  IT  staffing  is  flat   –  We  have  moved  a  few  people  out  of  IT  to  write  code  •  TradiRonal  IT  Roles  are  going  away   –  Don’t  need  SA,  DBA,  Storage,  Network  admins   –  Developers  deploy  and  run  what  they  wrote  in  producRon  
    • Ne:lix  “NoOps”  OrganizaRon   Developer  Org  ReporRng  into  Product  Development,  not  ITops   Ne:lix  Cloud  Pla:orm  Team   Cloud  Ops   Build  Tools   Database   Pla:orm   Cloud   Cloud   Reliability   and   Engineering   Development   Performance   SoluRons  Engineering   AutomaRon   Perforce  Jenkins   Pla:orm  jars   Cassandra   ArRfactory  JIRA   Benchmarking   Monitoring   Alert  RouRng   Key  store   Cassandra   Monkeys  Incident  Lifecycle   Base  AMI,  Bakery   Zookeeper   JVM  GC  Tuning   Ne:lix  App  Console   Wiresharking   Entrypoints   Astyanix   PagerDuty   AWS  Instances   AWS  API   AWS  Instances   AWS  Instances   AWS  Instances  
    • Wrap  Up     Answer  your  remaining  quesRons…    What  was  missing  that  you  wanted  to  cover?  
    • Takeaway     Ne5lix  has  built  and  deployed  a  scalable  global  Pla5orm  as  a  Service.    Key  components  of  the  Ne5lix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     h>p://github.com/Ne:lix   h>p://techblog.ne:lix.com   h>p://slideshare.net/Ne:lix     h>p://www.linkedin.com/in/adriancockcro6   @adrianco  #ne:lixcloud     End  of  Part  3  of  3