Cloud Architecture Tutorial - Running in the Cloud (3of3)

  • 30,823 views
Uploaded on

Part 3 of the talk covers how to transition to cloud, how to bootstrap developers, how to run cloud services including Cassandra, capacity planning and workload analysis, and organizational structure

Part 3 of the talk covers how to transition to cloud, how to bootstrap developers, how to run cloud services including Cassandra, capacity planning and workload analysis, and organizational structure

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
30,823
On Slideshare
0
From Embeds
0
Number of Embeds
59

Actions

Shares
Downloads
0
Comments
0
Likes
49

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloud  Architecture  Tutorial   Running  in  the  Cloud   Qcon  London  March  5th,  2012   Adrian  Cockcro6   @adrianco  #ne:lixcloud   h>p://www.linkedin.com/in/adriancockcro6   Part  3  of  3  
  • 2. Running  in  the  Cloud  Bring-­‐up  Strategy  for  Developers  and  TesRng   Capacity  Planning  and  Workloads   Running  Cassandra   Monitoring  and  Scalability   Availability  and  Resilience   OrganizaRonal  Structure        
  • 3. Cloud  Bring-­‐Up  Strategy   Simplest  and  Soonest  
  • 4. Shadow  Traffic  RedirecRon  •  Early  a>empt  to  send  traffic  to  cloud   –  Real  traffic  stream  to  validate  cloud  back  end   –  Uncovered  lots  of  process  and  tools  issues   –  Uncovered  Service  latency  issues  •  TV  Device  calls  Datacenter  API  and  Cloud  API   –  Returns  Genre/movie  list  for  a  customer   –  Asynchronously  duplicate  request  to  cloud   –  Start  with  send-­‐and-­‐forget  mode,  ignore  response  
  • 5. Shadow  Redirect  Instances   Modified   Datacenter   Datacenter   Service   Instances  Modified  Cloud   Cloud  Service   One  request  per   Instances   visit   Data  Sources   queueservice   videometadata  
  • 6. Video  Metadata  Server  •  VMS  instance  isolates  new  pla:orm  from  old  codebase   –  Isolate/unblock  cloud  team  from  metadata  team  schedule   –  Datacenter  code  supports  obsolete  movie  object   –  VMS  ESL  is  designed  to  support  new  video  facet  object  •  VMS  subsets  and  pre-­‐processes  the  metadata   –  Only  load  data  used  by  cloud  services   –  Fast  bulk  loads  for  VMS  clients  speed  startup  Rmes   –  Explore  next  generaRon  metadata  cache  architecture     Pa$ern  –  Add  services  to  isolate  old  and  new  code  base  
  • 7. First  Web  Pages  in  the  Cloud  
  • 8. First  Page  •  First  full  page  –  Basic  Genre   –  Simplest  page,  no  sub-­‐genres,  minimal  personalizaRon   –  Lots  of  investment  in  new  Struts  based  page  design   –  Uses  idenRty  cookie  to  lookup  in  member  info  svc  •  New  “merchweb”  front  end  instance   –  movies.ne:lix.com  points  to  merchweb  instance  •  Uncovered  lots  of  latency  issues   –  Used  memcached  to  hide  S3  and  SimpleDB  latency   –  Improved  from  slower  to  faster  than  Datacenter  
  • 9. Genre  Page  Cloud  Instances   Front  End   merchweb   mulRple  requests   Middle  Tier   genre    memcached   per  visit  Data  Sources   queueservice   rentalhistory   videometadata  
  • 10. Controlled  Cloud  TransiRon  •  WWW  calling  code  chooses  who  goes  to  cloud   –  Filter  out  corner  cases,  send  percentage  of  users   –  The  URL  that  customers  see  is   h>p://movies.ne:lix.com/WiContentPage?csid=1   –  If  problem,  redirect  to  old  Datacenter  page   h>p://www.ne:lix.com/WiContentPage?csid=1  •  Play  Bu>on  and  Star  RaRng  AcRon  redirect   –  Point  URLs  for  acRons  that  create/modify  data   back  to  datacenter  to  start  with  
  • 11. Cloud  Development  and  TesRng  Issues  
  • 12. Boot  Camp  •  One  day  “Ne:lix  Cloud  Training”  class   –  Has  been  run  6  Rmes  for  20-­‐45  people  each  Rme  •  Half  day  of  presentaRons  •  Half  day  hands-­‐on   –  Create  your  own  hello  world  app   –  Launch  in  AWS  test  account   –  Login  to  your  cloud  instances   –  Find  monitoring  data  on  your  cloud  instances   –  Connect  to  Cassandra  and  read/write  data  
  • 13. Very  First  Boot  Camp  •  Pathfinder  Bootstrap  Mission   –  Room  full  of  engineers  sharing  the  pain  for  1-­‐2  days   –  Built  a  very  rough  prototype  working  web  site  •  Get  everyone  hands-­‐on  with  a  new  code  base   –  Debug  lots  of  tooling  and  conceptual  issues  very  fast   –  Used  SimpleDB  to  create  mock  data  sources  •  Cloud  Specific  Key  Setup   –  Needed  to  integrate  with  AWS  security  model   –  New  concepts  for  datacenter  developers  
  • 14. Developer  Instances  Collision  Sam  and  Rex  both  want  to  deploy  web  front  end  for   development   Sam   Rex   web  in   test   account  
  • 15. Per-­‐Service  Namespace  Stack  RouRng   Developers  choose  what  to  share   Sam   Rex   Mike   web-­‐sam   web-­‐rex   web-­‐dev  backend-­‐dev   backend-­‐dev   backend-­‐mike  
  • 16. Developer  Namespace  Stacks  •  Developer  specific  service  instances   –  Configured  via  Java  properRes  at  runRme   –  RouRng  implemented  by  REST  client  library  •  Server  ConfiguraRon   –  Configure  discovery  service  version  string   –  Registers  as  <appname>-­‐<namespace>  •  Client  ConfiguraRon   –  Route  traffic  on  per-­‐service  basis  including   namespace  
  • 17. Capacity  Planning  Metrics  and   Methods  
  • 18. What  is  Capacity  Planning  •  We  care  about   –  CPU,  Memory,  Network  and  Disk  resource  uRlizaRon   –  ApplicaRon  response  Rmes  and  throughput  •  We  need  to  know   –  how  much  of  each  resource  we  are  using  now,  and  will  use  in   the  future   –  how  much  headroom  we  have  to  handle  higher  loads  •  We  want  to  understand   –  how  headroom  varies   –  how  it  relates  to  applicaRon  response  Rmes  and  throughput  
  • 19. Capacity  Planning  Norms  •  Capacity  is  expensive  •  Capacity  takes  Rme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  
  • 20. Capacity  Planning  in  Clouds   (a  few  things  have  changed…)  •  Capacity  is  expensive  •  Capacity  takes  Rme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  (reservaRons!)  
  • 21. Capacity  is  expensive   h>p://aws.amazon.com/s3/  &  h>p://aws.amazon.com/ec2/  •  Storage  (Amazon  S3)     –  $0.125  per  GB  –  first  50  TB  /  month  of  storage  used   –  $0.055  per  GB  –  storage  used  /  month  over  5  PB  •  Data  Transfer  (Amazon  S3)     –  $0.000  per  GB  –  all  data  transfer  in  is  free,  first  GB  out  is  free   –  $0.120  per  GB  –  first  10  TB  /  month  data  transfer  out   –  $0.050  per  GB  –  data  transfer  out  /  month  over  350  TB  •  Requests  (Amazon  S3  Storage  access  is  via  h>p)   –  $0.01  per  1,000  PUT,  COPY,  POST,  or  LIST  requests   –  $0.01  per  10,000  GET  and  all  other  requests   –  $0  per  DELETE  •  CPU  (Amazon  EC2)   –  Small  (Default)  $0.085/hour,  Extra  Large  $0.68/hour,  Four  XL  $2.00/hour   –  Small  (Default)  $0.08/hour,  Extra  Large  $0.64/hour,  Four  XL  $1.80/hour  •  Network  (Amazon  EC2)   –  Inbound/Outbound  around  $0.10  per  GB  
  • 22. Capacity  comes  in  big  chunks,  paid  up  front  •  Capacity  takes  Rme  to  buy  and  provision   –  No  minimum  price,  monthly  billing   –  “Amazon  EC2  enables  you  to  increase  or  decrease   capacity  within  minutes,  not  hours  or  days.  You  can   commission  one,  hundreds  or  even  thousands  of   server  instances  simultaneously”  •  Capacity  only  increases,  can’t  be  shrunk  easily   –  Pay  for  what  is  actually  used  •  Planning  errors  can  cause  big  problems   –  Size  only  for  what  you  need  now  
  • 23. Systems  are  clearly  defined  assets  •  You  are  running  in  a  “stateless”  mulR-­‐ tenanted  virtual  image  that  can  die  or  be   taken  away  and  replaced  at  any  Rme  •  You  don’t  know  exactly  where  it  is,  you  can   choose  to  locate  “US-­‐East”  or  “Europe”  etc.  •  You  can  specify  zones  that  will  not  share   components  to  avoid  common  mode  failures  
  • 24. Systems  can  be  instrumented  in  detail  •  Each  cloud  node  allocaRon  is  unique   –  So  elasRc  usage  pa>erns  keep  creaRng  new  nodes   –  “garbage  collect”  nodes  that  won’t  be  seen  again   –  Need  to  map  EIP  and  Cassandra  tokens  to  instances  •  Ne:lix  SoluRon  –  Entrypoints  Slots   –  Each  Autoscale  Group  has  a  size   –  Each  instance  is  given  a  slot  number  up  to  size   –  Replacements  pick  empty  slots  
  • 25. Depreciate  assets  over  3  years   (reservaRons!)  •  Reduced  costs  in  return  for  commitment  •  One  or  three  years,  upfront  payment  •  Payment  can  be  depreciated  as  capital  asset  •  Low,  medium  or  high  usage  reservaRons   –  Save  more  if  you  use  them  more  •  Spot  market  instances   –  Unused  reservaRons  sold  to  other  users  cheap   –  Will  be  yanked  at  any  Rme  if  needed  
  • 26. A  Discussion  of  Workloads  and   How  They  Behave  
  • 27. Workload  CharacterisRcs   •  A  quick  tour  through  a  taxonomy  of   workload  types   •  Start  with  the  easy  ones  and  work  up   •  Why  personalized  workloads  are  different   and  hard   •  Some  examples  and  coping  strategies  3/12/12   Slide  176  
  • 28. Simple  Random  Arrivals   •  Random  arrival  of  transacRons  with  fixed  mean   service  Rme   –  Li>le’s  Law:  QueueLength  =  Throughput  *  Response   –  URlizaRon  Law:  URlizaRon  =  Throughput  *  ServiceTime   •  Complex  models  are  o6en  reduced  to  this  model   –  By  averaging  over  longer  Rme  periods  since  the  formulas   only  work  if  you  have  stable  averages   –  By  wishful  thinking  (i.e.  how  to  fool  yourself)  3/12/12   Slide  177  
  • 29. Mixed  random  arrivals  of  transacRons   with  stable  mean  service  Rmes   •  Think  of  the  grocery  store  checkout  analogy   –  Trolleys  full  of  shopping  vs.  baskets  full  of  shopping   –  Baskets  are  quick  to  service,  but  get  stuck  behind  carts   –  RelaRve  mixture  of  transacRon  types  starts  to  ma>er   •  Many  transacRonal  systems  handle  a  mixture   –  Databases,  web  services   •  Consider  separaRng  fast  and  slow  transacRons   –  So  that  we  have  a  “10  items  or  less”  line  just  for  baskets   –  Separate  pools  of  servers  for  different  services   –  The  old  rule  -­‐  don’t  mix  OLTP  with  DSS  queries  in  databases   •  Performance  is  o6en  thread-­‐limited   –  Thread  limit  and  slow  transacRons  constrains  maximum  throughput   •  Model  mix  using  analyRcal  solvers  (e.g.  PDQ  perfdynamics.com)  3/12/12   Slide  178  
  • 30. Load  dependent  servers  –  varying   mean  service  Rmes  •  Mean  service  Rme  may  increase  at  high  throughput   –  Due  to  non-­‐scalable  algorithms,  lock  contenRon   –  System  runs  out  of  memory  and  starts  paging  or  frequent  GC  •  Mean  service  Rme  may  also  decrease  at  high  throughput   –  Elevator  seek  and  write  cancellaRon  opRmizaRons  in  storage   –  Load  shedding  and  simplified  fallback  modes  •  Systems  have  “Rpping  points”  if  the  service  Rme  increases   –  Hysteresis  means  they  don’t  come  back  when  load  drops   –  This  is  why  you  have  to  kill  catatonic  systems   –  Best  designs  shed  load  to  be  stable  at  the  limit  –  circuit  breaker  pa>ern   –  PracRcal  opRon  is  to  try  to  avoid  Rpping  points  by  reducing  variance    •  Model  using  discrete  event  simulaRon  tools   –  Behaviour  is  non-­‐linear  and  hard  to  model  3/12/12   Slide  179  
  • 31. Self-­‐similar  /  fractal  workloads  •  Bursty  rather  than  random  arrival  rates  •  Self-­‐similar   –  Looks  “random”  at  close  up,  stays  “random”  as  you  zoom  out   –  Work  arrives  in  bursts,  transacRons  aren’t  independent   –  Bursts  cluster  together  in  super-­‐bursts,  etc.  •  Network  packet  streams  tend  to  be  fractal  •  Common  in  pracRce,  too  hard  to  model   –  Probably  the  most  common  reason  why  your  model  is  wrong!  3/12/12   Slide  180  
  • 32. State  Dependent  Service  Workloads   •  Personalized  services  that  store  user  state/history   –  TransacRons  for  new  users  are  quick   –  TransacRons  for  users  with  lots  of  state/history  are  slower   –  As  user  base  builds  state  and  ages  you  get  into  trouble…   •  Social  Networks,  RecommendaRon  Services   –  Facebook,  Flickr,  Ne:lix,  Twi>er  etc.   •  “Abandon  hope  all  ye  who  enter  here”   –  Not  tractable  to  model,  repeatable  tests  are  tricky   –  Long  fat  tail  response  Rme  distribuRon  and  Rmeouts   •  Try  to  transform  workloads  to  more  tractable  forms  3/12/12   Slide  181  
  • 33. Example  -­‐  Twi>er  Workload  •  @adrianco  tweets  –  copy  to  3600  or  so  other  users  •  @zoecello  tweets  many  Rmes  a  day    –  to  over  1M  users  •  @barackobama  tweets  every  few  days  –  to  over  12M  users  •  It’s  the  same  transacRon,  but  the  service  Rme  varies  by  several   orders  of  magnitude  •  The  best  (most  acRve  and  connected  =  most  valuable)  users   trigger  a  “denial  of  service  a>ack”  on  the  systems  when  they   tweet  •  Cascading  effect  as  many  others  re-­‐tweet  3/12/12   Slide  182  
  • 34. Example  -­‐  Ne:lix  Movie  Choosing   •  “Pick  24  genres/subgenres  etc.  of  75  movies  each  for  me”   –  used  by  TV  based  devices  like  Xbox360,  PS/3,  iPhone  app   •  New  user   –  No  history  of  what  they  have  rented  (DVD)  or  streamed   –  No  star  raRngs  for  movies,  possibly  some  genre  raRngs   –  Basic  demographic  info   –  Fast  to  calculate,  easy  to  find  many  good  choices  to  return   •  User  with  several  years  tenure   –  Thousands  of  movies  rented  or  streamed,  “seen  it  already”   –  Hundreds  to  thousands  of  star  raRngs,  lots  of  genre  raRngs   –  Requests  may  Rme  out  and  return  fewer  or  worse  choices  3/12/12   Slide  183  
  • 35. Workload  Modelling  Survival   Methods   •  Simplify  the  workload  algorithms   –  move  from  hard  or  impossible  to  simpler  models   –  decouple,  cache  and  pre-­‐compute  to  get  constant  service  Rmes   •  Stand  further  away   –  averaging  is  your  friend  –  gets  rid  of  complex  fluctuaRons   •  Minimalist  Models   –  most  models  are  far  too  complex  –  the  classic  beginners  error…   –  the  art  of  modelling  is  to  only  model  what  really  ma>ers   •  Don’t  model  details  you  don’t  use   –  model  peak  hour  of  the  week,  not  day  to  day  fluctuaRons   –  e.g.  “Will  the  web  site  survive  next  Sunday  night?”  3/12/12   Slide  184  
  • 36. Running  Cassandra  
  • 37. Cassandra  Use  Cases  •  Key  by  Customer  –  Cross-­‐region  clusters   –  Many  app  specific  Cassandra  clusters,  read-­‐intensive   –  Keys+Rows  in  memory  using  m2.4xl  Instances  •  Key  by  Customer:Movie  –  e.g.  Viewing  History   –  Growing  fast,  write  intensive  –  m1.xl  instances   –  Keys  cached  in  memory,  one  cluster  per  region  •  Large  scale  data  logging  –  lots  of  writes   –  Column  data  expires  a6er  Rme  period   –  Distributed  counters,  one  cluster  per  region  
  • 38. Ne:lix  Pla:orm  Cassandra  AMI  •  Tomcat  server  with  Priam   –  Always  running,  registers  with  pla:orm   –  Manages  Cassandra  state,  tokens,  backups  •  Removed  Root  Disk  Dependency  on  EBS   –  Use  S3  backed  AMI  for  stateful  services   –  Normally  use  EBS  backed  AMI  for  fast  provisioning  
  • 39. Ne:lix  ContribuRons  to  Cassandra  •  Cassandra  as  a  mutable  toolkit   –  Cassandra  is  in  Java,  pluggable,  well  structured   –  Ne:lix  has  a  building  full  of  Java  engineers….   –  We  changed  Cassandra  to  make  it  run  much  be>er  on  AWS  •  ContribuRons  delivered  to  Cassandra   –  0.8  Prototype  off-­‐heap  row  cache,  SSTable  write  callback   –  1.x  OpRmizaRons  reduced  impact  of  repair  &  compacRon   –  January  2012  –  Ne:lix  engineer  becomes  core  commi>er  •  Cassandra  Based  Projects  on  github.com/Ne:lix   –  Priam  AWS  integraRon  and  backup  using  Tomcat  helper   –  Astyanax    Java  client  library   –  CassJMeter  for  performance  and  regression  tesRng  
  • 40. Monitoring  Tools  
  • 41. Monitoring  Vision  •  Problem   –  Too  many  tools,  each  with  a  good  reason  to  exist   –  Hard  to  get  an  integrated  view  of  a  problem   –  Too  much  manual  work  building  dashboards   –  Tools  are  not  discoverable,  views  are  not  filtered  •  SoluRon   –  Get  vendors  to  add  deep  linking  and  embedding   –  IntegraRon  “portal”  Res  everything  together   –  Dynamic  portal  generaRon,  relevant  data,  all  tools  
  • 42. Cloud  Monitoring  Mechanisms  •  Keynote  or  Gomez  etc.   –  External  URL  monitoring  •  Amazon  CloudWatch   –  Metrics  for  ELB  and  Instances  •  AppDynamics   –  End  to  end  transacRon  view  showing  resources  used   –  Powerful  real  Rme  debug  tools  for  latency,  CPU  and  Memory  •  Epic  (Ne:lix  in-­‐house  project)   –  Flexible  and  easy  to  use  to  extend  and  embed  plots  •  Logs   –  High  capacity  logging  and  analysis  framework   –  Hadoop  (log4j  -­‐>  Honu  -­‐>  EMR)  
  • 43. Using  AppDynamics  (simple  example  from  early  2010)  
  • 44. AppDynamics  Monitoring  of  Cassandra  –  AutomaRc  Discovery  
  • 45. Scalability  TesRng  •  Cloud  Based  TesRng  –  fricRonless,  elasRc   –  Create/destroy  any  sized  cluster  in  minutes   –  Many  test  scenarios  run  in  parallel  •  Test  Scenarios   –  Internal  app  specific  tests   –  Simple  “stress”  tool  provided  with  Cassandra  •  Scale  test,  keep  making  the  cluster  bigger   –  Check  that  tooling  and  automaRon  works…   –  How  many  ten  column  row  writes/sec  can  we  do?  
  • 46. <DrEvil>ONE  MILLION</DrEvil>  
  • 47. Scale-­‐Up  Linearity   h>p://techblog.ne:lix.com/2011/11/benchmarking-­‐cassandra-­‐scalability-­‐on.html   Client  Writes/s  by  node  count  –  Replica:on  Factor  =  3  1200000   1099837  1000000   800000   600000   537172   400000   366828   200000   174373   0   0   50   100   150   200   250   300   350  
  • 48. Stress  Client  Latency   Includes  ~10ms  Scheduling  Overhead  –  for  be>er  latency  data  see    h>p://techblog.ne:lix.com/2012/03/jmeter-­‐plugin-­‐for-­‐cassandra.html  
  • 49. Measured  at  the  Cassandra  Server  3.3  Million  writes/sec  at  0.014ms  –  14  microseconds  
  • 50. Per  Node  AcRvity   Per  Node   48  Nodes   96  Nodes   144  Nodes   288  Nodes  Per  Server  Writes/s   10,900  w/s   11,460  w/s   11,900  w/s   11,456  w/s  Mean  Server  Latency   0.0117  ms   0.0134  ms   0.0148  ms   0.0139  ms  Mean  CPU  %Busy   74.4  %   75.4  %   72.5  %   81.5  %  Disk  Read   5,600  KB/s   4,590  KB/s   4,060  KB/s   4,280  KB/s  Disk  Write   12,800  KB/s   11,590  KB/s   10,380  KB/s   10,080  KB/s  Network  Read   22,460  KB/s   23,610  KB/s   21,390  KB/s   23,640  KB/s  Network  Write   18,600  KB/s   19,600  KB/s   17,810  KB/s   19,770  KB/s   Node  specificaRon  –  Xen  Virtual  Images,  AWS  US  East,  three  zones   •  Cassandra  0.8.6,  CentOS,  SunJDK6   •  AWS  EC2  m1  Extra  Large  –  Standard  price  $  0.68/Hour   •  15  GB  RAM,  4  Cores,  1Gbit  network   •  4  internal  disks  (total  1.6TB,  striped  together,  md,  XFS)  
  • 51. Time  is  Money   48  nodes   96  nodes   144  nodes   288  nodes  Writes  Capacity   174373  w/s   366828  w/s   537172  w/s   1,099,837  w/s  Storage  Capacity   12.8  TB   25.6  TB   38.4  TB   76.8  TB  Nodes  Cost/hr   $32.64   $65.28   $97.92   $195.84  Test  Driver  Instances   10   20   30   60  Test  Driver  Cost/hr   $20.00   $40.00   $60.00   $120.00  Cross  AZ  Traffic   5  TB/hr   10  TB/hr   15  TB/hr   301  TB/hr  Traffic  Cost/10min   $8.33   $16.66   $25.00   $50.00  Setup  DuraRon   15  minutes   22  minutes   31  minutes   662  minutes  AWS  Billed  DuraRon   1hr   1hr   1  hr   2  hr  Total  Test  Cost   $60.97   $121.94   $182.92   $561.68   1  EsRmate  two  thirds  of  total  network  traffic     2  Workaround  for  a  tooling  bug  slowed  setup  
  • 52. Availability  and  Resilience  
  • 53. Chaos  Monkey  •  Computers  (Datacenter  or  AWS)  randomly  die   –  Fact  of  life,  but  too  infrequent  to  test  resiliency  •  Test  to  make  sure  systems  are  resilient   –  Allow  any  instance  to  fail  without  customer  impact  •  Chaos  Monkey  hours   –  Monday-­‐Thursday  9am-­‐3pm  random  instance  kill  •  ApplicaRon  configuraRon  opRon   –  Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  
  • 54. Responsibility  and  Experience  •  Make  developers  responsible  for  failures   –  Then  they  learn  and  write  code  that  doesn’t  fail  •  Use  Incident  Reviews  to  find  gaps  to  fix   –  Make  sure  its  not  about  finding  “who  to  blame”  •  Keep  Rmeouts  short,  fail  fast   –  Don’t  let  cascading  Rmeouts  stack  up  •  Make  configuraRon  opRons  dynamic   –  You  don’t  want  to  push  code  to  tweak  an  opRon  
  • 55. Resilient  Design  –  Circuit  Breakers  h>p://techblog.ne:lix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  
  • 56. PaaS  OperaRonal  Model  -­‐  NoOps  •  Developers   –  Provision  and  run  their  own  code  in  producRon   –  Take  turns  to  be  on  call  if  it  breaks  (pagerduty)   –  Configure  autoscalers  to  handle  capacity  needs  •  Difference  between  DevOps  and  NoOps   –  DevOps  is  about  Dev  and  Ops  working  together   –  NoOps  constrains  Dev  to  use  automaRon  instead   –  NoOps  puts  more  responsibility  on  Dev,  with  tools  
  • 57. ImplicaRons  for  IT  OperaRons  •  Cloud  is  run  by  developer  organizaRon   –  Our  IT  department  is  the  AWS  API   –  We  have  no  IT  staff  working  on  cloud  (they  do  corp  IT)  •  Cloud  capacity  is  10x  bigger  than  Datacenter   –  Datacenter  oriented  IT  staffing  is  flat   –  We  have  moved  a  few  people  out  of  IT  to  write  code  •  TradiRonal  IT  Roles  are  going  away   –  Don’t  need  SA,  DBA,  Storage,  Network  admins   –  Developers  deploy  and  run  what  they  wrote  in  producRon  
  • 58. Ne:lix  “NoOps”  OrganizaRon   Developer  Org  ReporRng  into  Product  Development,  not  ITops   Ne:lix  Cloud  Pla:orm  Team   Cloud  Ops   Build  Tools   Database   Pla:orm   Cloud   Cloud   Reliability   and   Engineering   Development   Performance   SoluRons  Engineering   AutomaRon   Perforce  Jenkins   Pla:orm  jars   Cassandra   ArRfactory  JIRA   Benchmarking   Monitoring   Alert  RouRng   Key  store   Cassandra   Monkeys  Incident  Lifecycle   Base  AMI,  Bakery   Zookeeper   JVM  GC  Tuning   Ne:lix  App  Console   Wiresharking   Entrypoints   Astyanix   PagerDuty   AWS  Instances   AWS  API   AWS  Instances   AWS  Instances   AWS  Instances  
  • 59. Wrap  Up     Answer  your  remaining  quesRons…    What  was  missing  that  you  wanted  to  cover?  
  • 60. Takeaway     Ne5lix  has  built  and  deployed  a  scalable  global  Pla5orm  as  a  Service.    Key  components  of  the  Ne5lix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     h>p://github.com/Ne:lix   h>p://techblog.ne:lix.com   h>p://slideshare.net/Ne:lix     h>p://www.linkedin.com/in/adriancockcro6   @adrianco  #ne:lixcloud     End  of  Part  3  of  3