• Save
Performance architecture for cloud connect
Upcoming SlideShare
Loading in...5
×
 

Performance architecture for cloud connect

on

  • 8,365 views

Slide deck that kicked off the performance day at Cloud Connect March 2011

Slide deck that kicked off the performance day at Cloud Connect March 2011

Statistics

Views

Total Views
8,365
Slideshare-icon Views on SlideShare
6,247
Embed Views
2,118

Actions

Likes
15
Downloads
0
Comments
1

6 Embeds 2,118

http://softwarestrategiesblog.com 2101
http://www.linkedin.com 6
url_unknown 4
http://webcache.googleusercontent.com 3
https://www.linkedin.com 3
http://trunk.ly 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Thank you, Adrian,
    This was very informative and where possible reflects my (limited) experience.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Performance architecture for cloud connect Performance architecture for cloud connect Presentation Transcript

    • Performance  Architecture  for   Cloud   March  7,  2011   Adrian  Cockcro:   @adrianco  #ne=lixcloud  #ccevent   h@p://www.linkedin.com/in/adriancockcro:   acockcro:@ne=lix.com  
    • Who,  Why,  What   Ne=lix  in  the  Cloud   Cloud  Performance  Challenges  Performance  Architecture  and  Tools    
    • Ne=lix.com  is  now  ~100%  Cloud   See  h@p://techblog.ne=lix.com   Detailed  SlideShare  presentaQon  :  Ne=lix  on  Cloud   h@p://slideshare.net/adrianco   We  have  25  minutes  -­‐  not  half  a  day  to  discuss  everything!  
    • A  Nice  Problem  To  Have…  h@p://techblog.ne=lix.com/2011/02/redesigning-­‐ne=lix-­‐api.html   37x  Growth  Jan   2010-­‐Jan  2011  
    • Data  Center   We  stopped   building  our  own   datacenters   Capacity  growth  is  acceleraQng,  unpredictable   Product  launch  spikes  -­‐  iPhone,  Wii,  PS3,  XBox  
    • We  want  to  use  clouds,  we  don’t  have  Qme  to  build  them   Public  cloud  for  agility  and  scale   AWS  because  they  are  big  enough  to  allocate   thousands  of  instances  per  hour  for  us  
    • Ne=lix  EC2  Instances  per  Account   (summer  2010,  producQon  is  up  ~3x  now…)  “Many  Thousands”   Content  Encoding   Test  and  ProducQon   Log  Analysis   “Several  Months”  
    • AWS  Performance?   Mostly  good,  be@er  than  expected  over-­‐all  •  The  Good   –  Large  EC2  Instance  types  (esp.  the  m2  range)   –  Internal  disk  performance   –  Network  performance  within  and  between   Availability  Zones   –  Robustness  and  scalability  of  S3,  SQS    •  The  Bad   –  ElasQc  Load  Balancer  has  too  many  limitaQons   –  SimpleDB  needs  memcached  front  end,  too   many  limitaQons  at  Terabyte  scale  •  The  Ugly   –  EBS  performance  is  slow  and  inconsistent,  we   avoid  it  
    • Learnings  •  Datacenter  oriented  tools  don’t  work   –  Ephemeral  instances   –  High  rate  of  change   –  Need  too  much  hand-­‐holding  and  manual  setup  •  Cloud  Tools  Don’t  Scale  for  Enterprise   –  Too  many  tools  are  “Startup”  oriented   –  Built  our  own  tools  for  1000’s  of  instances   –  Drove  vendors  to  be  dynamic,  scale,  add  APIs  •  “fork-­‐li:ed”  apps  are  fragile   –  Too  many  datacenter  oriented  assumpQons   –  We  re-­‐wrote  our  code  base!   –  (We  re-­‐write  it  conQnuously  anyway)  
    • Cloud  Performance  Challenges   Model  Driven  Architecture   Capacity  Planning  &  Metrics  
    • Model  Driven  Architecture  •  Datacenter  PracQces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  pa@erns  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Hudson  based  builds  for  everything   –  Every  producQon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaQon  is  managed  by  an  Autoscaler   No  excep(ons,  every  change  is  a  new  AMI  
    • Model  Driven  ImplicaQons  •  Automated  “Least  Privilege”  Security   –  Tightly  specified  security  groups   –  Fine  grain  IAM  keys  to  access  AWS  resources   –  Performance  tools  security  and  integraQon  •  Model  Driven  Performance  Monitoring   –  Hundreds  of  instances  appear  in  a  few  minutes…   –  Tools  have  to  “garbage  collect”  dead  instances    
    • Capacity  Planning  &  Metrics  
    • What  is  Capacity  Planning?  •  We  care  about   –  CPU,  Memory,  Network  and  Disk  resources  consumed   –  ApplicaQon  response  Qmes  •  We  need  to  know   –  how  much  of  each  resource  we  are  using  now   –  how  much  will  we  use  in  the  future   –  how  much  headroom  we  have  to  handle  higher  loads  •  We  want  to  understand   –  how  headroom  varies   –  how  it  relates  to  response  Qmes  and  throughput  
    • Capacity  Planning  in  Clouds   (a  few  things  have  changed…)  •  Capacity  is  expensive  •  Capacity  takes  Qme  to  buy  and  provision  •  Capacity  only  increases,  can’t  be  shrunk  easily  •  Capacity  comes  in  big  chunks,  paid  up  front  •  Planning  errors  can  cause  big  problems  •  Systems  are  clearly  defined  assets  •  Systems  can  be  instrumented  in  detail  •  Depreciate  assets  over  3  years  (reservaQons!)  
    • OK,  so  just  give  me  the  data!   Throughput  –  not  hard   Response  Time  –  mean+2xSD?  %iles?   UQlizaQon….  
    • UQlizaQon  “UQlizaQon  is  virtually  useless  as  a  metric”   CMG  2006  Paper  by  Adrian  Cockcro:   VirtualizaQon  is  a  DOS  a@ack  on  Capacity   Planning…  
    • What  would  you  say  if  you  were  asked:  Q:  That  system  is  slow,  how  busy  is  it?  A:  I  have  no  idea…  A:  The  graph  in  this  tool  looks  about  50%  A:  But  the  graph  in  this  other  tool  is  65%  A:  Amazon  CloudWatch  says  82%  A:  Linux  says  us  sy  ni  id  wa  st  L  A:  Why  do  you  want  to  know?  A:  I’m  sorry,  you  don’t  understand  your  quesQon….  
    • Whats  the  problem  with  UQlizaQon?  •  CPU  Capacity   –  Varying  capacity  due  to  mulQ-­‐tenancy   –  Non-­‐idenQcal  servers  or  CPUs  (check  /proc/cpuinfo)   –  Non-­‐linear  capacity  due  to  hyperthreading  etc.  •  Measurement  Errors   –  Monitoring  tools  that  ignore  “stolen  Qme”  (all  of  them)   –  Mechanisms  with  built  in  bias  (clock  Qck  counQng)   –  Pla=orm  and  release  specific  changes  in  metrics   Every  tool  shows  a  different  value  for  the  same  metric!  
    • Performance  Tools  Architecture  
    • Monitoring  Issues  •  Problem   –  Too  many  tools,  each  with  a  good  reason  to  exist   –  Hard  to  get  an  integrated  view  of  a  problem   –  Too  much  manual  work  building  dashboards   –  Tools  are  not  discoverable,  views  are  not  filtered  •  SoluQon   –  Get  vendors  to  add  deep  linking  URLs  and  APIs   –  IntegraQon  “portal”  Qes  everything  together   –  Underlying  dependency  database   –  Dynamic  portal  generaQon,  relevant  data,  all  tools  
    • Data  Sources   • External  URL  availability  and  latency  alerts  and  reports  –  Keynote   External  TesQng   • Stress  tesQng  -­‐  SOASTA   • Ne=lix  REST  calls  –  Chukwa  to  DataOven  with  GUID  transacQon  idenQfier   Request  Trace  Logging   • Generic  HTTP  –  AppDynamics  service  Qer  aggregaQon,  end  to  end  tracking   • Tracers  and  counters  –  log4j,  tracer  central,  Chukwa  to  DataOven   ApplicaQon  logging   • Trackid  and  Audit/Debug  logging  –  DataOven,  Appdynamics    GUID  cross  reference   • ApplicaQon  specific  real  Qme  –  Nimso:,  Appdynamics,  Epic   JMX    Metrics   • Service  and  SLA  percenQles  –  Nimso:,  Appdynamics,  Epic,logged  to  DataOven   • Stdout  logs  –  S3  –  DataOven,  Nimso:  alerQng  Tomcat  and  Apache  logs   • Standard  format  Access  and  Error  logs  –  S3  –  DataOven,  Nimso:  AlerQng   • Garbage  CollecQon  –  Nimso:,  Appdynamics   JVM   • Memory  usage,  call  stacks,  resource/call  -­‐  AppDynamics   • system  CPU/Net/RAM/Disk  metrics  –  AppDynamics,  Epic,  Nimso:  AlerQng   Linux   • SNMP  metrics  –  Epic,  Network  flows  -­‐  FasQp   • Load  balancer  traffic  –  Amazon  Cloudwatch,  SimpleDB  usage  stats   AWS   • System  configuraQon    -­‐  CPU  count/speed  and  RAM  size,  overall  usage  -­‐  AWS  
    • Integrated  Dashboards  
    • Dashboards  Architecture  •  Integrated  Dashboard  View   –  Single  web  page  containing  content  from  many  tools   –  Filtered  to  highlight  most  “interesQng”  data  •  Relevance  Controller   –  Drill  in,  add  and  remove  content  interacQvely   –  Given  an  applicaQon,  alert  or  problem  area,  dynamically   build  a  dashboard  relevant  to  your  role  and  needs  •  Dependency  and  Incident  Model   –  Model  Driven  -­‐  Interrogates  tools  and  AWS  APIs   –  Document  store  to  capture  dependency  tree  and  states  
    • Dashboard  Prototype   (not  everything  is  integrated  yet)  
    • AppDynamics   How  to  look  deep  inside  your  cloud  applicaQons  •  AutomaQc  Monitoring   –  Base  AMI  bakes  in  all  monitoring  tools   –  Outbound  calls  only  –  no  discovery/polling  issues   –  InacQve  instances  removed  a:er  a  few  days    •  Incident  Alarms  (deviaQon  from  baseline)   –  Business  TransacQon  latency  and  error  rate   –  Alarm  thresholds  discover  their  own  baseline   –  Email  contains  URL  to  Incident  Workbench  UI  
    • Using  AppDynamics  (simple  example  from  early  2010)  
    • Switch  to  Snapshot  View   Pick  a  slow  call  graph  
    • InteracQons  for  this  Snapshot   Click  to  view  call  graph  
    • Point  Finger  and  Assess  Impact   (an  async  S3  write  was  slow,  no  big  deal)  
    • Summary  •  Performance  of  AWS  Systems  isn’t  an  issue  •  Broken  datacenter  tools  and  metrics  is  the  issue!  •  IntegraQng  too  many  different  tools   –  They  are  not  designed  to  be  integrated   –  Did  I  menQon  that  I  hate  flash  based  user  interfaces?   –  We  have  “persuaded”  vendors  to  add  APIs  •  If  you  can’t  see  deep  inside  your  app,  you’re  L   QuesQons?  Job  ApplicaQons?   @adrianco  #ne=lixcloud  #ccevent