• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Netflix Velocity Conference 2011
 

Netflix Velocity Conference 2011

on

  • 48,435 views

Slide deck from June 14th 2011 Velocity Conference workshop presentation

Slide deck from June 14th 2011 Velocity Conference workshop presentation

Statistics

Views

Total Views
48,435
Views on SlideShare
40,774
Embed Views
7,661

Actions

Likes
117
Downloads
0
Comments
5

53 Embeds 7,661

http://www.hackingnetflix.com 1705
http://blog.newvem.com 1417
http://www.standingonthebrink.com 1086
http://knight76.tistory.com 852
http://ameblo.jp 618
http://velocityconf.com 443
http://www.samhamilton.co.uk 318
http://storify.com 210
http://investorsmosaic.squarespace.com 153
http://www.cloudinsights.org 103
http://codesilo.blogspot.com 83
http://localhost 74
http://lanyrd.com 72
url_unknown 71
http://paper.li 61
http://s23a121.dm1.oii.oki.co.jp 56
http://www.investorsmosaic.com 52
http://faves.eapen.in 47
http://cloudinsights.typepad.com 46
http://www.worldit.info 31
http://www.cloud24by7.com 23
http://www.slideshare.net 19
http://www.rockmycloud.com 14
http://s.ameblo.jp 14
https://twitter.com 12
https://pramati.qontext.com 10
http://www.twylah.com 10
http://grumpy.junta.com.au 8
http://www.techgig.com 7
http://feeds.samhamilton.co.uk 6
http://webcache.googleusercontent.com 4
http://www.linkedin.com 4
http://translate.googleusercontent.com 3
http://www.istikbal.gr 3
http://twitter.com 3
http://a0.twimg.com 3
https://si0.twimg.com 2
http://blog.ameba.jp 2
http://aprendersociales.blogspot.com 2
http://www.reeep.org 1
http://news.taaza.com 1
http://codesilo.blogspot.co.uk 1
http://cafe.naver.com 1
http://us-w1.rockmelt.com 1
http://codesilo.blogspot.de 1
http://www.melodysmithjones.com 1
http://samhamilton-blog.herokuapp.com 1
http://ranksit.com 1
http://users.sch.gr 1
http://www.mongodb.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 5 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Very good presentation. I would be pleasure if you would like to let me to have a copy of your presentation. Please email to patilprashant777@gmail.com Thanking you!!!
    Are you sure you want to
    Your message goes here
    Processing…
  • 1) Thanks for sharing, and 2) we're working on transcript extraction!
    Are you sure you want to
    Your message goes here
    Processing…
  • thanks for this. great stuff on great a service.
    FYI if you pdf as images you can suppress Slideshare's silly transcript generator.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thanks for sharing -- fascinating stuff
    Are you sure you want to
    Your message goes here
    Processing…
  • I don't know how slideshare generates the transcript, but they need to fix it...
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Netflix Velocity Conference 2011 Netflix Velocity Conference 2011 Presentation Transcript

    • Ne#lix  Cloud  Architecture   Velocity  Conference  June  14,  2011   Adrian  Cockcro=   @adrianco  #ne#lixcloud  h@p://slideshare.net/adrianco   acockcro=@ne#lix.com  
    • Who,  Why,  What   Ne#lix  in  the  Cloud   Cloud  Challenges  and  Learnings   (Ignite)  Systems  and  OperaOons  Architecture    
    • Ne#lix  Inc.   With  more  than  23  million  subscribers  in  the  United   States  and  Canada,  Ne9lix,  Inc.  is  the  world’s  leading   Internet  subscripAon  service  for  enjoying  movies  and   TV  shows.     InternaAonal  Expansion   We  plan  to  expand  into  an  addiAonal  market  in  the   second  half  of  2011…  If  the  second  market  meets  our   expectaAons…  we  will  conAnue  to  invest  and  expand   aggressively  in  2012.  Source:  h@p://ir.ne#lix.com  
    • Unlimited  streaming  for  $7.99/month,  large  and  growing  catalog  of  movies  and  TV  
    • Adrian  Cockcro=  •  Director,  Architecture  for  Cloud  Systems,  Ne#lix  Inc.   –  Previously  Director  for  PersonalizaOon  Pla#orm  •  DisOnguished  Availability  Engineer,  eBay  Inc.  2004-­‐7   –  Founding  member  of  eBay  Research  Labs  •  DisOnguished  Engineer,  Sun  Microsystems  Inc.  1988-­‐2004   –  2003-­‐4  Chief  Architect  High  Performance  Technical  CompuOng   –  2001  Author:  Capacity  Planning  for  Web  Services   –  1999  Author:  Resource  Management   –  1995  &  1998  Author:  Sun  Performance  and  Tuning   –  1996  Japanese  EdiOon  of  Sun  Performance  and  Tuning   •   SPARC  &  Solaris ( )  
    • Why  is  Ne#lix  Talking  about   Cloud?  
    • Ne#lix  is  Path-­‐finding   The  Cloud  ecosystem  is  evolving  very  fast  Share  with  and  learn  from  the  cloud  community  
    • We  want  to  use  clouds,   not  build  them   Cloud  technology  should  be  a  commodity  Public  cloud  and  open  source  for  agility  and  scale  
    • Why  Use  Cloud?     For  Be@er  Business  Agility  For  Unpredictable  Business  Growth  
    • Data  Center   Ne#lix  could  not   build  new   datacenters  fast   enough   Capacity  growth  is  acceleraOng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  XBox  
    • 23  Million  Customers   2011-­‐Q1  year/year  customers  +69%     25   20   15   10   5   0  Source:  h@p://ir.ne#lix.com  
    • Out-­‐Growing  Data  Center   h@p://techblog.ne#lix.com/2011/02/redesigning-­‐ne#lix-­‐api.html   37x  Growth  Jan   2010-­‐Jan  2011  Datacenter  Capacity  
    • Ne#lix.com  is  now  ~100%  Cloud   Account  sign-­‐up  is  currently  being  moved  to  cloud   All  internaOonal  product  will  be  cloud  based   USA  specific  logisOcs  remains  in  the  Datacenter    
    • Leverage  AWS  Scale   “the  biggest  public  cloud”   AWS  investment  in  tooling  and  automaOon  Use  many  AWS  zones  for  high  availability,  scalability   AWS  skills  are  most  common  on  resumes…  
    • Leverage  AWS  Feature  Set   “the  market  leader”  EC2,  S3,  SDB,  SQS,  EBS,  EMR,  ELB,  ASG,  IAM,  RDB,  VPC…   h@p://aws.amazon.com/jp  
    • “The  cloud  lets  its  users  focus   on  delivering  differenAaAng   business  value  instead  of   wasAng  valuable  resources   on  the  undifferen)ated   heavy  li0ing  that  makes   up  most  of  IT   infrastructure.”      Werner  Vogels    Amazon  CTO    
    • We  want  to  use  clouds,  we  don’t  have  Ome  to  build  them   Public  cloud  for  agility  and  scale   AWS  because  they  are  big  enough  to  allocate  thousands   of  instances  per  hour  when  we  need  to  
    • Ne#lix  EC2  Instances  per  Account   (summer  2010,  producOon  is  much  higher  now…)  “Many  Thousands”   Content  Encoding   Test  and  ProducOon   Log  Analysis   “Several  Months”  
    • Ne#lix  Deployed  on  AWS  Content   Logs   Play   WWW   API   Video   S3   DRM   Sign-­‐Up   Metadata   Masters   EMR   CDN   Device   EC2   Search   Hadoop   rouOng   Config   Movie   TV  Movie   S3   Hive   Bookmarks   Choosing   Choosing   Business   Mobile   CDN   Logging   RaOngs   Intelligence   iPhone  
    • Cloud  Encoding  Pipeline   Encode   S3   Encode   S3  Movie   Master   Network   S3   Copy  to   CDN   Stream  Studios   Ne#lix   Master   Mezza-­‐ Mezza-­‐ to    50+   Origin   Origin   Tapes   Upload   nine   files   CDN   to  TV   nine   files   Licensed  content  is  provided  to  Ne#lix  as  high  quality  master  tapes   Many  formats  are  reduced  to  a  single  high  quality  mezzanine  format  on  S3   Individual  formats  and  speeds  are  encoded  in  over  50  combinaOons    Many  formats  for  older  and  newer  hardware  and  various  game  consoles    Many  speeds  from  mobile  through  standard  and  high  definiOon   StaOc  files  are  copied  to  each  Content  Delivery  Network’s  “origin  server”   CDNs  migrate  files  to  “edge  servers”  near  the  end  user   Files  stream  to  PC/Mac/iPad  or  TV  over  HTTP  using  “range  get”  to  move  chunks  
    • Cloud  Architecture   Ignite!  
    • Product  Trade-­‐off  User  Experience   ImplementaOon   Consistent   Development   Experience   complexity   OperaOonal   Low  Latency   complexity  
    • Ne#lix  Cloud  Goals  •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenOle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verOcally  scaled  databases   –  Leverage  AWS  elasOc  capacity  effecOvely  •  Available   –  SubstanOally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulOple  AWS  availability  zones   –  No  scheduled  down  Ome,  no  central  database  schema  to  change  •  ProducOve   –  OpOmize  agility  of  a  large  development  team  with  automaOon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
    • Old  Datacenter  vs.  New  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL   SOcky  In-­‐Memory  Session   Shared  Memcached  Session   Cha@y  Protocols   Latency  Tolerant  Protocols   Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  Pa@erns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
    • The  Central  SQL  Database  •  Datacenter  has  a  central  database   –  Everything  in  one  place  is  convenient  unOl  it  fails   –  Customers,  movies,  history,  configuraOon  •  Schema  changes  require  downOme     AnA-­‐paUern  impacts  scalability,  availability  
    • The  Distributed  Key-­‐Value  Store  •  Cloud  has  many  key-­‐value  data  stores   –  More  complex  to  keep  track  of,  do  backups  etc.   –  Each  store  is  much  simpler  to  administer   DBA   –  Joins  take  place  in  java  code  •  No  schema  to  change,  no  scheduled  downOme  •  Latency  for  Memcached  vs.  Oracle  vs.  SimpleDB   –  Memcached  is  dominated  by  network  latency  <1ms   –  Oracle  for  simple  queries  is  a  few  milliseconds   –  SimpleDB  has  replicaOon  and  REST  overheads  >10ms  
    • The  SOcky  Session  •  Datacenter  SOcky  Load  Balancing   –  Efficient  caching  for  low  latency   –  Tricky  session  handling  code   –  Middle  Oer  load  balancer  has  issues  in  pracOce  •  Encourages  concentrated  funcOonality   –  one  service  that  does  everything     AnA-­‐paUern  impacts  producAvity,  availability  
    • The  Shared  Session  •  Cloud  Uses  Round-­‐Robin  Load  Balancing   –  Simple  request-­‐based  code   –  External  shared  caching  with  memcached  •  More  flexible  fine  grain  services   –  Works  be@er  with  auto-­‐scaled  instance  counts  
    • Cha@y  Opaque  and  Bri@le  Protocols  •  Datacenter  service  protocols   –  Assumed  low  latency  for  many  simple  requests  •  Based  on  serializing  exisOng  java  objects   –  Inefficient  formats   –  IncompaOble  when  definiOons  change     AnA-­‐paUern  causes  producAvity,  latency  and   availability  issues  
    • Robust  and  Flexible  Protocols  •  Cloud  service  protocols   –  JSR311/Jersey  is  used  for  REST/HTTP  service  calls   –  Custom  client  code  includes  service  discovery   –  Support  complex  data  types  in  a  single  request  •  Apache  Avro   –  Evolved  from  Protocol  Buffers  and  Thri=   –  Includes  JSON  header  defining  key/value  protocol   –  Avro  serializaOon  is  half  the  size  and  several  Omes   faster  than  Java  serializaOon,  more  work  to  code  
    • Persisted  Protocols  •  Persist  Avro  in  Memcached   –  Save  space/latency  (zigzag  encoding,  half  the  size)   –  Less  bri@le  across  versions   –  New  keys  are  ignored   –  Missing  keys  are  handled  cleanly  •  Avro  protocol  definiOons   –  Can  be  wri@en  in  JSON  or  generated  from  POJOs   –  It’s  hard,  needs  be@er  tooling  
    • Tangled  Service  Interfaces  •  Datacenter  implementaOon  is  exposed   –  Oracle  SQL  queries  mixed  into  business  logic  •  Tangled  code   –  Deep  dependencies,  false  sharing  •  Data  providers  with  sideways  dependencies   –  Everything  depends  on  everything  else   AnA-­‐paUern  affects  producAvity,  availability  
    • Untangled  Service  Interfaces  •  New  Cloud  Code  With  Strict  Layering   –  Compile  against  interface  jar   –  Can  use  spring  runOme  binding  to  enforce  •  Service  interface  is  the  service   –  ImplementaOon  is  completely  hidden   –  Can  be  implemented  locally  or  remotely   –  ImplementaOon  can  evolve  independently  
    • Untangled  Service  Interfaces  Two  layers:  •  SAL  -­‐  Service  Access  Library   –  Basic  serializaOon  and  error  handling   –  REST  or  POJO’s  defined  by  data  provider  •  ESL  -­‐  Extended  Service  Library   –  Caching,  conveniences   –  Can  combine  several  SALs   –  Exposes  faceted  type  system  (described  later)   –  Interface  defined  by  data  consumer  in  many  cases  
    • Service  InteracOon  Pa@ern   Sample  Swimlane  Diagram  
    • Service  Architecture  Pa@erns  •  Internal  Interfaces  Between  Services   –  Common  pa@erns  as  templates   –  Highly  instrumented,  observable,  analyOcs   –  Service  Level  Agreements  –  SLAs  •  Library  templates  for  generic  features   –  Instrumented  Ne#lix  Base  Servlet  template   –  Instrumented  generic  client  interface  template   –  Instrumented  S3,  SimpleDB,  Memcached  clients  
    • CLIENT   Request  Start   Timestamp,   Client   Inbound   Request  End   outbound   deserialize  end   Timestamp   serialize  start   Omestamp   Omestamp   Inbound   Client   deserialize   outbound   start   serialize  end   Omestamp   Omestamp  Client  network   receive   Omestamp   Service  Request   Client  Network   send   Omestamp   Instruments  Every   Service  network  send   Omestamp   Step  in  the  call   Service   Network   receive   Omestamp   Service   Service   outbound   inbound   serialize  end   serialize  start   Omestamp   Omestamp   Service   Service   outbound   inbound   serialize  start   SERVICE  execute   serialize  end   request  start   Omestamp   Omestamp   Omestamp,   execute  request   end  Omestamp  
    • Boundary  Interfaces  •  Isolate  teams  from  external  dependencies   –  Fake  SAL  built  by  cloud  team   –  Real  SAL  provided  by  data  provider  team  later   –  ESL  built  by  cloud  team  using  faceted  objects  •  Fake  data  sources  allow  development  to  start   –  e.g.  Fake  IdenOty  SAL  for  a  test  set  of  customers   –  Development  solidifies  dependencies  early   –  Helps  external  team  provide  the  right  interface  
    • One  Object  That  Does  Everything  •  Datacenter  uses  a  few  big  complex  objects   –  Movie  and  Customer  objects  are  the  foundaOon   –  Good  choice  for  a  small  team  and  one  instance   –  ProblemaOc  for  large  teams  and  many  instances  •  False  sharing  causes  tangled  dependencies   –  UnproducOve  re-­‐integraOon  work     AnA-­‐paUern  impacAng  producAvity  and   availability  
    • An  Interface  For  Each  Component  •  Cloud  uses  faceted  Video  and  Visitor   –  Basic  types  hold  only  the  idenOfier   –  Facets  scope  the  interface  you  actually  need   –  Each  component  can  define  its  own  facets  •  No  false-­‐sharing  and  dependency  chains   –  Type  manager  converts  between  facets  as  needed   –  video.asA(PresentaOonVideo)  for  www   –  video.asA(MerchableVideo)  for  middle  Oer  
    • So=ware  Architecture  Pa@erns  •  Object  Models   –  Basic  and  derived  types,  facets,  serializable   –  Pass  by  reference  within  a  service   –  Pass  by  value  between  services  •  ComputaOon  and  I/O  Models   –  Service  ExecuOon  using  Best  Effort   –  Common  thread  pool  management  
    • Ne#lix  Systems  Architecture  
    • API   AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   API  etc.   Load  Balancer   Component   API   SQS   Services   Oracl e   Oracle   Oracle   memcached   memcached   ReplicaOon   EBS   NeAlix   S3   Data  Center  AWS  Storage   SimpleDB  
    • Database  MigraOon  •  Why  SimpleDB?   –  No  DBA’s  in  the  cloud,  Amazon  hosted  service   –  Work  started  two  years  ago,  fewer  viable  opOons   –  Worked  with  Amazon  to  speed  up  and  scale  SimpleDB  •  AlternaOves?   –  Rolling  out  Cassandra  as  “upgrade”  from  SimpleDB   –  Need  several  opOons  to  match  use  cases  well  •  Detailed  NoSQL  and  SimpleDB  Advice   –  Sid  Anand    -­‐  QConSF  Nov  5th  –  Ne#lix’  TransiOon  to  High   Availability  Storage  Systems   –  Blog  -­‐  h@p://pracOcalcloudcompuOng.com/   –  Download  Paper  PDF  -­‐  h@p://bit.ly/bhOTLu  
    • Cloud  OperaOons   Model  Driven  Architecture  Capacity  Planning  &  Monitoring  
    • Tools  and  AutomaOon  •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArOfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches  •  Custom  Ne#lix  ApplicaOon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producOon  •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop,  OpenJDK,  CentOS  •  Monitoring  Tools   –  Keynote  –  service  monitoring  and  alerOng   –  Custom  metric  collecOon  and  alerOng  under  development   –  Datastax  OpsCenter  –  Cassandra  Monitoring   –  AppDynamics  –  Developer  focus  for  cloud  h@p://appdynamics.com  
    • Model  Driven  Architecture  •  Datacenter  PracOces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa@erns  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producOon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaOon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
    • High  Availability  Zones  •  Each  zone  is  a  separate  datacenter   –  Private  power,  cooling,  network  connecOons   –  Located  close  together  for  low  latency  •  ASG  Instances  are  distributed  over  3  zones  •  Data  wri@en  to  one  zone  appears  in  all  zones  •  Ne#lix  survived  total  failure  of  one  zone  (!)   –  Increase  capacity  of  exisOng  zones  by  50%   –  Small  or  zero  downOme  
    • Cross  Region  Backups  •  Data  is  backed  up  into  a  different  cloud  region   –  Different  AWS  S3  account,  encrypted  for  security   –  AddiOonal  archive’s  created  on  a  different  vendor  •  Restore  to  a  new  region   –  Create  model  driven  architecture   –  Send  traffic  to  new  region  
    • Model  Driven  ImplicaOons  •  Automated  “Least  Privilege”  Security   –  Tightly  specified  security  groups   –  Fine  grain  IAM  keys  to  access  AWS  resources   –  Performance  tools  security  and  integraOon  •  Model  Driven  Performance  Monitoring   –  Hundreds  of  instances  appear  in  a  few  minutes…   –  Tools  have  to  “garbage  collect”  dead  instances    
    • Ne#lix  App  Console  
    • Auto  Scale  Group  ConfiguraOon  
    • Learnings  •  Datacenter  oriented  tools  don’t  work   –  Ephemeral  instances   –  High  rate  of  change   –  Need  too  much  hand-­‐holding  and  manual  setup  •  Many  Cloud  Tools  Don’t  Scale  for  Enterprise   –  Too  many  tools  are  “Startup”  oriented   –  Built  our  own  tools  for  1000’s  of  instances   –  Drove  vendors  to  be  dynamic,  scale,  add  APIs  •  Un-­‐modified  Datacenter  Apps  are  Fragile   –  Too  many  datacenter  oriented  assumpOons   –  We  re-­‐wrote  our  code  base!   –  (We  re-­‐write  it  conOnuously  anyway)  
    • Capacity  Planning  &  Monitoring  
    • Capacity  Planning  in  Clouds   (a  few  things  have  changed…)  •  Capacity  is  expensive  •  Capacity  takes  Ome  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  (reservaOons!)  
    • Monitoring  Issues  •  Problem   –  Too  many  tools,  each  with  a  good  reason  to  exist   –  Hard  to  get  an  integrated  view  of  a  problem   –  Too  much  manual  work  building  dashboards   –  Tools  are  not  discoverable,  views  are  not  filtered  •  SoluOon   –  Get  vendors  to  add  deep  linking  URLs  and  APIs   –  IntegraOon  “portal”  Oes  everything  together   –  Underlying  dependency  database   –  Dynamic  portal  generaOon,  relevant  data,  all  tools  
    • Data  Sources   • External  URL  availability  and  latency  alerts  and  reports  –  Keynote   External  TesOng   • Stress  tesOng  -­‐  SOASTA   • Ne#lix  REST  calls  –  Chukwa  to  DataOven  with  GUID  transacOon  idenOfier   Request  Trace  Logging   • Generic  HTTP  –  AppDynamics  service  Oer  aggregaOon,  end  to  end  tracking   • Tracers  and  counters  –  log4j,  tracer  central,  Chukwa  to  DataOven   ApplicaOon  logging   • Trackid  and  Audit/Debug  logging  –  DataOven,  Appdynamics    GUID  cross  reference   • ApplicaOon  specific  real  Ome  –  Nimso=,  Appdynamics,  Epic   JMX    Metrics   • Service  and  SLA  percenOles  –  Nimso=,  Appdynamics,  Epic,logged  to  DataOven   • Stdout  logs  –  S3  –  DataOven,  Nimso=  alerOng  Tomcat  and  Apache  logs   • Standard  format  Access  and  Error  logs  –  S3  –  DataOven,  Nimso=  AlerOng   • Garbage  CollecOon  –  Nimso=,  Appdynamics   JVM   • Memory  usage,  call  stacks,  resource/call  -­‐  AppDynamics   • system  CPU/Net/RAM/Disk  metrics  –  AppDynamics,  Epic,  Nimso=  AlerOng   Linux   • SNMP  metrics  –  Epic,  Network  flows  –  boundary.com   • Load  balancer  traffic  –  Amazon  Cloudwatch,  SimpleDB  usage  stats   AWS   • System  configuraOon    -­‐  CPU  count/speed  and  RAM  size,  overall  usage  -­‐  AWS  
    • AppDynamics   How  to  look  deep  inside  your  cloud  applicaOons  •  AutomaOc  Monitoring   –  Base  AMI  bakes  in  all  monitoring  tools   –  Outbound  calls  only  –  no  discovery/polling  issues   –  InacOve  instances  removed  a=er  a  few  days    •  Incident  Alarms  (deviaOon  from  baseline)   –  Business  TransacOon  latency  and  error  rate   –  Alarm  thresholds  discover  their  own  baseline   –  Email  contains  URL  to  Incident  Workbench  UI  
    • Using  AppDynamics  (simple  example  from  early  2010)  
    • Point  Finger  and  Assess  Impact   (an  async  S3  write  was  slow,  no  big  deal)  
    • Monitoring  Summary  •  Broken  datacenter  oriented  tools  is  a  big  problem  •  IntegraOng  many  different  tools   –  They  are  not  designed  to  be  integrated   –  We  have  “persuaded”  vendors  to  add  APIs  •  If  you  can’t  see  deep  inside  your  app,  you’re  L  
    • Wrap  Up  
    • ImplicaOons  for  IT  OperaOons  •  Cloud  is  run  by  developer  organizaOon   –  Our  IT  department  is  Amazon  Cloud   –  Forming  “Cloud  OperaOons  Reliability  Eng”  team    •  TradiOonal  IT  Roles  are  going  away   –  Don’t  need  SA,  DBA,  Storage,  Network  admins   –  Database  Engineering  Team  runs  SDB/Cassandra  
    • Next  Few  Years…  •  “System  of  Record”  moves  to  Cloud  (now)   –  Master  copies  of  data  live  only  in  the  cloud,  with  backups   –  Cut  the  datacenter  to  cloud  replicaOon  link,  turn  off  Oracle  databases  •  InternaOonal  Expansion  –  Global  Clouds  (later  in  2011)   –  Rapid  deployments  to  new  markets  •  Cloud  StandardizaOon?   –  Cloud  features  and  APIs  should  be  a  commodity  not  a  differenOator   –  DifferenOate  on  scale  and  quality  of  service   –  CompeOOon  and  scale  drives  cost  down   –  Higher  resilience  and  scalability     We  would  prefer  to  be  an  insignificant  customer  in  a  giant  cloud  
    • Takeaway    Ne9lix  is  path-­‐finding  the  use  of  public  AWS   cloud  to  replace  in-­‐house  IT  for  non-­‐trivial  applicaAons  with  hundreds  of  developers  and   thousands  of  systems.     acockcro=@ne#lix.com   h@p://www.linkedin.com/in/adriancockcro=   @adrianco  #ne#lixcloud  
    • Amazon Cloud Terminology See http://aws.amazon.com/jp for Japanese This is not a full list of Amazon Web Service features•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaOon  code)  •  EC2  –  ElasOc  Compute  Cloud   –  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraOons.   –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.   –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage   –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosOng  cloud  instances   –  Region  –  group  of  Availability  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan  •  ASG  –  Auto  Scaling  Group  (instances  booOng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (h@p  access)  •  EBS  –  ElasOc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDB  –  RelaOonal  Data  Base  (managed  MySQL  master  and  slaves)  •  SDB  –  Simple  Data  Base  (hosted  h@p  based  NoSQL  data  store)  •  SQS  –  Simple  Queue  Service  (h@p  based  message  queue)  •  SNS  –  Simple  NoOficaOon  Service  (h@p  and  email  based  topics  and  messages)  •  EMR  –  ElasOc  Map  Reduce  (automaOcally  managed  Hadoop  cluster)  •  ELB  –  ElasOc  Load  Balancer  •  EIP  –  ElasOc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (extension  of  enterprise  datacenter  network  into  cloud)  •  IAM  –  IdenOty  and  Access  Management  (fine  grain  role  based  security  keys)