Performance	  Architecture	  for	            Cloud	                   March	  7,	  2011	                  Adrian	  Cockcro...
Who,	  Why,	  What	         Ne=lix	  in	  the	  Cloud	    Cloud	  Performance	  Challenges	  Performance	  Architecture	  ...	  is	  now	  ~100%	  Cloud	                 See	  h@p://	    Detailed	  SlideShare	  present...
A	  Nice	  Problem	  To	  Have…	  h@p://­‐ne=lix-­‐api.html	                     3...
Data	  Center	                                     We	  stopped	                                   building	  our	  own	  ...
We	  want	  to	  use	  clouds,	  we	  don’t	  have	  Qme	  to	  build	  them	             Public	  cloud	  for	  agility	 ...
Ne=lix	  EC2	  Instances	  per	  Account	                  (summer	  2010,	  producQon	  is	  up	  ~3x	  now…)	  “Many	  T...
AWS	  Performance?	                      Mostly	  good,	  be@er	  than	  expected	  over-­‐all	  •  The	  Good	       –  L...
Learnings	  •  Datacenter	  oriented	  tools	  don’t	  work	        –  Ephemeral	  instances	        –  High	  rate	  of	 ...
Cloud	  Performance	  Challenges	         Model	  Driven	  Architecture	        Capacity	  Planning	  &	  Metrics	  
Model	  Driven	  Architecture	  •  Datacenter	  PracQces	     –  Lots	  of	  unique	  hand-­‐tweaked	  systems	     –  Har...
Model	  Driven	  ImplicaQons	  •  Automated	  “Least	  Privilege”	  Security	     –  Tightly	  specified	  security	  group...
Capacity	  Planning	  &	  Metrics	  
What	  is	  Capacity	  Planning?	  •  We	  care	  about	       –  CPU,	  Memory,	  Network	  and	  Disk	  resources	  cons...
Capacity	  Planning	  in	  Clouds	                       (a	  few	  things	  have	  changed…)	  •    Capacity	  is	  expen...
OK,	  so	  just	  give	  me	  the	  data!	         Throughput	  –	  not	  hard	   Response	  Time	  –	  mean+2xSD?	  %iles...
UQlizaQon	  “UQlizaQon	  is	  virtually	  useless	  as	  a	  metric”	     CMG	  2006	  Paper	  by	  Adrian	  Cockcro:	   V...
What	  would	  you	  say	  if	  you	  were	  asked:	  Q:	  That	  system	  is	  slow,	  how	  busy	  is	  it?	  A:	  I	  h...
Whats	  the	  problem	  with	  UQlizaQon?	  •  CPU	  Capacity	      –  Varying	  capacity	  due	  to	  mulQ-­‐tenancy	    ...
Performance	  Tools	  Architecture	  
Monitoring	  Issues	  •  Problem	     –  Too	  many	  tools,	  each	  with	  a	  good	  reason	  to	  exist	     –  Hard	 ...
Data	  Sources	                                        • External	  URL	  availability	  and	  latency	  alerts	  and	  re...
Integrated	  Dashboards	  
Dashboards	  Architecture	  •  Integrated	  Dashboard	  View	      –  Single	  web	  page	  containing	  content	  from	  ...
Dashboard	  Prototype	    (not	  everything	  is	  integrated	  yet)	  
AppDynamics	          How	  to	  look	  deep	  inside	  your	  cloud	  applicaQons	  •  AutomaQc	  Monitoring	     –  Base...
Using	  AppDynamics	  (simple	  example	  from	  early	  2010)	  
Switch	  to	  Snapshot	  View	       Pick	  a	  slow	  call	  graph	  
InteracQons	  for	  this	  Snapshot	         Click	  to	  view	  call	  graph	  
Point	  Finger	  and	  Assess	  Impact	   (an	  async	  S3	  write	  was	  slow,	  no	  big	  deal)	  
Summary	  •  Performance	  of	  AWS	  Systems	  isn’t	  an	  issue	  •  Broken	  datacenter	  tools	  and	  metrics	  is	 ...
Upcoming SlideShare
Loading in...5

Performance architecture for cloud connect


Published on

Slide deck that kicked off the performance day at Cloud Connect March 2011

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Performance architecture for cloud connect

  1. 1. Performance  Architecture  for   Cloud   March  7,  2011   Adrian  Cockcro:   @adrianco  #ne=lixcloud  #ccevent   h@p://  
  2. 2. Who,  Why,  What   Ne=lix  in  the  Cloud   Cloud  Performance  Challenges  Performance  Architecture  and  Tools    
  3. 3.  is  now  ~100%  Cloud   See  h@p://   Detailed  SlideShare  presentaQon  :  Ne=lix  on  Cloud   h@p://   We  have  25  minutes  -­‐  not  half  a  day  to  discuss  everything!  
  4. 4. A  Nice  Problem  To  Have…  h@p://­‐ne=lix-­‐api.html   37x  Growth  Jan   2010-­‐Jan  2011  
  5. 5. Data  Center   We  stopped   building  our  own   datacenters   Capacity  growth  is  acceleraQng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  XBox  
  6. 6. We  want  to  use  clouds,  we  don’t  have  Qme  to  build  them   Public  cloud  for  agility  and  scale   AWS  because  they  are  big  enough  to  allocate   thousands  of  instances  per  hour  for  us  
  7. 7. Ne=lix  EC2  Instances  per  Account   (summer  2010,  producQon  is  up  ~3x  now…)  “Many  Thousands”   Content  Encoding   Test  and  ProducQon   Log  Analysis   “Several  Months”  
  8. 8. AWS  Performance?   Mostly  good,  be@er  than  expected  over-­‐all  •  The  Good   –  Large  EC2  Instance  types  (esp.  the  m2  range)   –  Internal  disk  performance   –  Network  performance  within  and  between   Availability  Zones   –  Robustness  and  scalability  of  S3,  SQS    •  The  Bad   –  ElasQc  Load  Balancer  has  too  many  limitaQons   –  SimpleDB  needs  memcached  front  end,  too   many  limitaQons  at  Terabyte  scale  •  The  Ugly   –  EBS  performance  is  slow  and  inconsistent,  we   avoid  it  
  9. 9. Learnings  •  Datacenter  oriented  tools  don’t  work   –  Ephemeral  instances   –  High  rate  of  change   –  Need  too  much  hand-­‐holding  and  manual  setup  •  Cloud  Tools  Don’t  Scale  for  Enterprise   –  Too  many  tools  are  “Startup”  oriented   –  Built  our  own  tools  for  1000’s  of  instances   –  Drove  vendors  to  be  dynamic,  scale,  add  APIs  •  “fork-­‐li:ed”  apps  are  fragile   –  Too  many  datacenter  oriented  assumpQons   –  We  re-­‐wrote  our  code  base!   –  (We  re-­‐write  it  conQnuously  anyway)  
  10. 10. Cloud  Performance  Challenges   Model  Driven  Architecture   Capacity  Planning  &  Metrics  
  11. 11. Model  Driven  Architecture  •  Datacenter  PracQces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa@erns  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Hudson  based  builds  for  everything   –  Every  producQon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaQon  is  managed  by  an  Autoscaler   No  excep(ons,  every  change  is  a  new  AMI  
  12. 12. Model  Driven  ImplicaQons  •  Automated  “Least  Privilege”  Security   –  Tightly  specified  security  groups   –  Fine  grain  IAM  keys  to  access  AWS  resources   –  Performance  tools  security  and  integraQon  •  Model  Driven  Performance  Monitoring   –  Hundreds  of  instances  appear  in  a  few  minutes…   –  Tools  have  to  “garbage  collect”  dead  instances    
  13. 13. Capacity  Planning  &  Metrics  
  14. 14. What  is  Capacity  Planning?  •  We  care  about   –  CPU,  Memory,  Network  and  Disk  resources  consumed   –  ApplicaQon  response  Qmes  •  We  need  to  know   –  how  much  of  each  resource  we  are  using  now   –  how  much  will  we  use  in  the  future   –  how  much  headroom  we  have  to  handle  higher  loads  •  We  want  to  understand   –  how  headroom  varies   –  how  it  relates  to  response  Qmes  and  throughput  
  15. 15. Capacity  Planning  in  Clouds   (a  few  things  have  changed…)  •  Capacity  is  expensive  •  Capacity  takes  Qme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  (reservaQons!)  
  16. 16. OK,  so  just  give  me  the  data!   Throughput  –  not  hard   Response  Time  –  mean+2xSD?  %iles?   UQlizaQon….  
  17. 17. UQlizaQon  “UQlizaQon  is  virtually  useless  as  a  metric”   CMG  2006  Paper  by  Adrian  Cockcro:   VirtualizaQon  is  a  DOS  a@ack  on  Capacity   Planning…  
  18. 18. What  would  you  say  if  you  were  asked:  Q:  That  system  is  slow,  how  busy  is  it?  A:  I  have  no  idea…  A:  The  graph  in  this  tool  looks  about  50%  A:  But  the  graph  in  this  other  tool  is  65%  A:  Amazon  CloudWatch  says  82%  A:  Linux  says  us  sy  ni  id  wa  st  L  A:  Why  do  you  want  to  know?  A:  I’m  sorry,  you  don’t  understand  your  quesQon….  
  19. 19. Whats  the  problem  with  UQlizaQon?  •  CPU  Capacity   –  Varying  capacity  due  to  mulQ-­‐tenancy   –  Non-­‐idenQcal  servers  or  CPUs  (check  /proc/cpuinfo)   –  Non-­‐linear  capacity  due  to  hyperthreading  etc.  •  Measurement  Errors   –  Monitoring  tools  that  ignore  “stolen  Qme”  (all  of  them)   –  Mechanisms  with  built  in  bias  (clock  Qck  counQng)   –  Pla=orm  and  release  specific  changes  in  metrics   Every  tool  shows  a  different  value  for  the  same  metric!  
  20. 20. Performance  Tools  Architecture  
  21. 21. Monitoring  Issues  •  Problem   –  Too  many  tools,  each  with  a  good  reason  to  exist   –  Hard  to  get  an  integrated  view  of  a  problem   –  Too  much  manual  work  building  dashboards   –  Tools  are  not  discoverable,  views  are  not  filtered  •  SoluQon   –  Get  vendors  to  add  deep  linking  URLs  and  APIs   –  IntegraQon  “portal”  Qes  everything  together   –  Underlying  dependency  database   –  Dynamic  portal  generaQon,  relevant  data,  all  tools  
  22. 22. Data  Sources   • External  URL  availability  and  latency  alerts  and  reports  –  Keynote   External  TesQng   • Stress  tesQng  -­‐  SOASTA   • Ne=lix  REST  calls  –  Chukwa  to  DataOven  with  GUID  transacQon  idenQfier   Request  Trace  Logging   • Generic  HTTP  –  AppDynamics  service  Qer  aggregaQon,  end  to  end  tracking   • Tracers  and  counters  –  log4j,  tracer  central,  Chukwa  to  DataOven   ApplicaQon  logging   • Trackid  and  Audit/Debug  logging  –  DataOven,  Appdynamics    GUID  cross  reference   • ApplicaQon  specific  real  Qme  –  Nimso:,  Appdynamics,  Epic   JMX    Metrics   • Service  and  SLA  percenQles  –  Nimso:,  Appdynamics,  Epic,logged  to  DataOven   • Stdout  logs  –  S3  –  DataOven,  Nimso:  alerQng  Tomcat  and  Apache  logs   • Standard  format  Access  and  Error  logs  –  S3  –  DataOven,  Nimso:  AlerQng   • Garbage  CollecQon  –  Nimso:,  Appdynamics   JVM   • Memory  usage,  call  stacks,  resource/call  -­‐  AppDynamics   • system  CPU/Net/RAM/Disk  metrics  –  AppDynamics,  Epic,  Nimso:  AlerQng   Linux   • SNMP  metrics  –  Epic,  Network  flows  -­‐  FasQp   • Load  balancer  traffic  –  Amazon  Cloudwatch,  SimpleDB  usage  stats   AWS   • System  configuraQon    -­‐  CPU  count/speed  and  RAM  size,  overall  usage  -­‐  AWS  
  23. 23. Integrated  Dashboards  
  24. 24. Dashboards  Architecture  •  Integrated  Dashboard  View   –  Single  web  page  containing  content  from  many  tools   –  Filtered  to  highlight  most  “interesQng”  data  •  Relevance  Controller   –  Drill  in,  add  and  remove  content  interacQvely   –  Given  an  applicaQon,  alert  or  problem  area,  dynamically   build  a  dashboard  relevant  to  your  role  and  needs  •  Dependency  and  Incident  Model   –  Model  Driven  -­‐  Interrogates  tools  and  AWS  APIs   –  Document  store  to  capture  dependency  tree  and  states  
  25. 25. Dashboard  Prototype   (not  everything  is  integrated  yet)  
  26. 26. AppDynamics   How  to  look  deep  inside  your  cloud  applicaQons  •  AutomaQc  Monitoring   –  Base  AMI  bakes  in  all  monitoring  tools   –  Outbound  calls  only  –  no  discovery/polling  issues   –  InacQve  instances  removed  a:er  a  few  days    •  Incident  Alarms  (deviaQon  from  baseline)   –  Business  TransacQon  latency  and  error  rate   –  Alarm  thresholds  discover  their  own  baseline   –  Email  contains  URL  to  Incident  Workbench  UI  
  27. 27. Using  AppDynamics  (simple  example  from  early  2010)  
  28. 28. Switch  to  Snapshot  View   Pick  a  slow  call  graph  
  29. 29. InteracQons  for  this  Snapshot   Click  to  view  call  graph  
  30. 30. Point  Finger  and  Assess  Impact   (an  async  S3  write  was  slow,  no  big  deal)  
  31. 31. Summary  •  Performance  of  AWS  Systems  isn’t  an  issue  •  Broken  datacenter  tools  and  metrics  is  the  issue!  •  IntegraQng  too  many  different  tools   –  They  are  not  designed  to  be  integrated   –  Did  I  menQon  that  I  hate  flash  based  user  interfaces?   –  We  have  “persuaded”  vendors  to  add  APIs  •  If  you  can’t  see  deep  inside  your  app,  you’re  L   QuesQons?  Job  ApplicaQons?   @adrianco  #ne=lixcloud  #ccevent