PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

865 views
693 views

Published on

Presentation PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang at the AMD Developer Summit (APU13) November 11-13, 2013

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
865
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
78
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Applications Using PPA , by Hui Huang, Zhaoqiang Zheng and Lihua Zhang

  1. 1. MEASURING  AND  OPTIMIZING  PERFORMANCE  OF  CLUSTER   AND  PRIVATE  CLOUD  APPLICATIONS   BY  USING  PPA    
  2. 2. MULTICOREWARE  INC   LIHUA.ZHANG     HUI.HUANG   ANDY.ZHENG    
  3. 3. IntroducEon  to  MCW  PPA™  For  Cluster   A  tracing  tool  targets  the  distributed  systems.   !  Distributely  collect  instrumented  data  and  hardware  measurements  within  a   tracing  infrastructure.   !  Provide  visualizaEons  with    intuiEve  graphs/GanX  charts  and  generate  staEsEc                     reports  intended  for  idenEfying  criEcal  paths.   !  Do  offline  analysis  that  aids  in  understanding  target  system’s  behavior  and   reasoning  about  performance  issues.   !  PPA  Product  series     PPA For Cluster PPA Workstation Edition 3   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL   PPA For Android
  4. 4. Main  Features   !  Low  overhead   ‒   Have  negligible  performance  impact  on  the  running  applicaEons  by  relying  on  the   PPA  runEme  library.  This  is  very  useful  for  highly  opEmized  cases  which  are   performance  sensiEve.   !  InstrumentaBon  on  applicaBon  level   ‒  The  PPA  runEme  library  provides  APIs  to  measure  codes.  The  hardware   measurement  part  is  very  transparent  to  the  developers.  And  these  PPA  codes  can  be   easily  cleanup  by  turning  on  a  disable  opEon.   ‒  Auto-­‐instrumentaEon  of  binaries  available  soon.   !  Scalability   ‒  The  tool  can  be  extended  to  profile  clusters  with  various  scales  (now  up  to  4000   nodes)  and  services  (e.g.  Hadoop).  This  benefits  from  PPA’s  distributed  data   repositories,  big-­‐data  process  and  buffered  views  of  visualizaEons  etc.   ‒  PPA  Profiler  can  be  extended  to  support  HW  vendor  specific  features   4   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  5. 5.  The  Highlights   !  Profiler  and  performance  analyzer   ‒  ‒  ‒  ‒  ‒  ‒  ‒  Low  overhead  (almost  no  cost  if  no  profiling  capture  is  enabled)   CPU  &  GPU  acEvity  traces   Hardware  uElizaEons  measurement   HW  Vendor  specific  support   Features  Eme-­‐based  views  and  staEsEcal  analysis  /  reports   MulE-­‐core  profiling  at  process/thread  at  source  code   Good  data  organizaEon  in  intuiEve  colour  schemes   !  Big  data  support   ‒  Storage   ‒  Smooth  visualizaEon   !  System-­‐wide  criEcal  paths  idenEficaEon   ‒  ‒  ‒  ‒  Correlate  hardware  uElizaEons  and  CPU  events  in  the  same  Emeline   Cluster  wide  global  clock  synchronizaEon   MulE-­‐views  for  sessions  from  different  nodes  in  the  same  Emeline   RunEme  monitors   !  Customizable  for  specific  applicaEons,  e.g.  Hadoop     5   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  6. 6.  Developer  Library  Overview   !  C/C++  SDK   ‒  Already  used  in  numerous  OpenCL™  applicaEons   !  Java  Support   ‒  Java  bindings  for  OpenCL™  applicaEons   !  Thread-­‐safe   !   Low  overhead  if  no  capture   !   Transparent  for  OpenCL  instrumentaEons   ‒  ‒  ‒  ‒    Timing  OpenCL  APIs    Timing  kernels  &  data  transfers:  start/submit/queue/complete   Visualize  construcEon  of  dependence  graph  between  kernels  &  data  transfer   Exclusive  sub-­‐kernel  support  for  AMD  GFX  cards   C/C++ Provide  a  friendly  Interface   (ppaAPI.h)  for  the  C/C++  developer.   6   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL   JAVA Provide  a  friendly  Interface   (JPPA.jar)  for  the  JAVA  developer.  
  7. 7. System  Overview     !  Distributed  repositories  for  trace  data   !  Distributed  post-­‐processing  to  minimize  overhead   !  Powerful  visualizaEon  engine   !  Scalability  to  any  scale  of  cluster  system   Presentation layer UI Logic layer Network layer Profiler Logic layer Data layer Graphics  Rendering Raw Data Post Processing Communication Framework Processed Data Repository Data Transfer Profiler Control (Start/Stop etc.) Data collecting by PPA Profiler 7   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL   Data serialize for Presentation Fault-tolerant Synchronization and heartbeat etc. Other profiler logic Raw Data Repository
  8. 8. Gepng  Started   !  Install  PPA  Clients  and  PPA  Server  on  the  target  plaqorms     ‒  Deploy  PPA  Clients  by  scripts   ‒  Support  CLI  for  capture   ‒  Generally  PPA  Server  is  running  on  master  node   !  Set  up  capture  opEons   ‒  ‒  ‒  ‒  ‒  Node  IP,  communicaEon  Port…   OpEonally  select  nodes  to  profile   OpEonally  enable  CPU  Event  filters   OpEonally  enable  CPU  Event  merge     Hardware  measurement  is  by  default   !  Collect  data  and  analysis  reports   !  Operate  views   8   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  9. 9. Summary  View   !  Available  to  help  find  the  problemaEc  nodes  or  un-­‐balanced  loads.   !  Tell  difference  between  different  runs     Multistage Table Bar Charts 9   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  10. 10. The  Sharp  UElity:  Timeline  View   !  Correlate  CPU  Events  to  HW  performance  in  analysis   Monitoring application’s behaviour Session and its node list Monitoring hardware behavior 10   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL   Zoom in/out from hour to ns resolutions
  11. 11. Profiling    Data   !  CPU  Events  Level   ‒  Thread   ‒  Name   ‒  Core  miEgaEon   ‒  Timing   !  OpenCL  traces   !  Hardware  counters   ‒  %  CPU  Usage   ‒  Memory  Usage   ‒  Bytes  read/write  of  Disk   ‒  Bytes  in/write  of  the  Net   ‒  Cache  hit/miss   !  StaEsEcs   ‒  Process/Thread  involved   ‒  #  of  total  CPU  Events   ‒  #  of  the  same  CPU  Events   ‒  Min/Max/Average  for  each   11   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  12. 12. Timeline  View  for  CPU  Events   !  Process-­‐thread-­‐event  data   ‒  IdenEfy  the  problemaEc  process/thread/event   ‒  Tell  the  dependency   ‒  Tell  parent  &  child   ‒  Frames  analyzer  for  frame-­‐based  program   Expand process Expand thread 12   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  13. 13. Timeline  View  for  HW  measurement   !  Aggregate  performance  data   !  Per-­‐core  data   When is the critical throughput on disk? Abnormal load of the Network? When the CPU usage is very low or high? 13   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  14. 14. Where  mulE-­‐views  Help  OpEmizaEon   !  IdenEfy  node’s  abnormal  behavior   !  Difference/relaEons  between  nodes   !  Job  scheduler  maXers   14   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  15. 15. Hadoop  with  PPA  on  AWS  as  Demo   !  Overview  of  the  tracing  infrastructure   15   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  16. 16. Setup  AWS  EC2  instance   !  16  Hadoop  nodes  (dual  core  node  with  7.5GB  memory)   !  4GB  Hadoop  Terasort  Workload   !  >  1.2  GB  PPA  trace  per  node   16   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  17. 17. Run  Hadoop  jobs   !  Start  the  capture   !  Jobs  are  done  by  map  &  reduce   17   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  18. 18. Remote  control  by  VNC  viewer   !  Intended  for  mulEple  users  on  AWS   !  Experience  and  operate  PPA  from  different  connect  points   18   |      PRESENTATION  TITLE      |      DECEMBER  4,  2013      |      CONFIDENTIAL  
  19. 19. CONTACT  US:     CURTIS@MULTICOREWAREINC.COM     LIHUA@MULTICOREWAREINC.COM     MANU@MULTICOREWAREINC.COM    

×