• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman
 

CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman

on

  • 612 views

Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.

Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.

Statistics

Views

Total Views
612
Views on SlideShare
611
Embed Views
1

Actions

Likes
0
Downloads
12
Comments
0

1 Embed 1

http://www.steampdf.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman Presentation Transcript

    • CHARACTERIZING  APU  PERFORMANCE  IN  HADOOPCL   ON  HETEROGENEOUS  DISTRIBUTED  PLATFORMS   MAX  GROSSMAN,  MAURICIO  BRETERNITZ,  AND  VIVEK  SARKAR   RICE  UNIVERSITY  &  AMD  
    • MOTIVATION   ! Cloud  offers  elasHcity,  lowered  startup  costs,  unified  plaQorm  for  all   ! Generally  see  worse  and  less  predictable  performance   ‒ Noisy  neighbor   ! Economics  of  scale  =>  cloud  is  here  to  stay     “I  don’t  care  where  my  code  runs,  as  long   as  it  finishes…  someday”  –  Bob  the  Cloud   User   2   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • STATE-­‐OF-­‐THE-­‐ART   ! Hadoop   ‒ Java  programming  language   ‒ JDK  libraries   ‒ Arbitrary  data  types   ‒ Reliability   ‒ Simple  MapReduce  distributed   programming  model   !  AbstracHons  built  on  Hadoop   ‒ H2O  from  0xdata   ‒ Mahout  machine  learning  framework   3   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • PROBLEMS   1.  Poor  computaHonal  performance   ‒  JVM  execuHon,  short-­‐lived  tasks  implies  poor  JIT,   high  startup  cost  for  creaHng  child  processes   2.  Poor  I/O  performance   ‒  SerializaHon,  deserializaHon  of  arbitrary  data  types   3.  Manual  tweaking  of  intertwined  tunables   ‒  In  an  unstable  cloud  environment,  you  never  have   it  right   4.  Scheduling  execuHon  &  communicaHon  with  a   holisHc  view  of  the  plaQorm   4   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   " A  small  sampling  of  Hadoop  tunables…  
    • A  POTENTIAL  SOLUTION   !  OpenCL   ‒ SIMD  programming  model   ‒ MulH-­‐architecture  and  mulH-­‐vendor  support   ‒ APIs  for  launching  compute  and  copy  tasks   !  An  expert  programmer  could:   1.  2.  3.  4.  Translate  all  applicaHon  code  to  OpenCL  kernels   Compile  OpenCL  kernels,  API  calls  into  naHve  library   Call  naHve  library  from  Java  via  JNI   Spend  a  lot  of  Hme  debugging  performance  and   correctness   ! SHll  not  good  enough!   5   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   Host   Host   ApplicaHon   Device   clEnqueueNDRange()  
    • Hadoop     Reliability   Distributed  PlaQorm   APARAPI     bytecode  to   OpenCL   kernels   OpenCL     MulH-­‐architecture  execuHon   in  naHve  threads     6   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   !  Hardware  aware  plaQorm  manager   !  Machine-­‐learning,  mulH-­‐device  scheduler   based  on  device  occupancy  and  past   kernel  performance   !  Architecture  aware  opHmizing  compiler   !  Hadoop-­‐like  API  
    • HADOOPCL  ARCHITECTURE     class  PiMapper  extends          DoubleDoubleBoolIntHadoopCLMapper  {        public  void  map(double  x,                double  y)  {          if(x  *  x  +  y  *  y  >  0.25)  {              write(false,  1);          }  else  {              write(true,  1);          }      }   }     job.waitForCompletion(true);   7   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   javac   .class   !  HadoopCL  programming  model  supports   ‒  Java  syntax   ‒  MapReduce  abstracHons   ‒  Dynamic  memory  allocaHon   ‒  Variety  of  data  types  (primiHves,  sparse  vectors,  tuples,   etc)  and  can  be  extended  to  more   ‒  Constant  globals  accessible  from  anywhere   !  HadoopCL  does  not  support   ‒  Arbitrary  inputs,  outputs   ‒  Massive  data  elements  (i.e.  sparse  vectors  larger  than   device  memory)   ‒  Object  references  
    • HADOOPCL  ARCHITECTURE   $  hadoop  jar  Pi.jar  input  output   NameNode  +   JobTracker   DataNode   DataNode   8   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   Hadoop  DataNode   Task   Map  or  Reduce   HadoopCL   Child   TaskTracker   HadoopCL  ML  Device   Scheduler   HadoopCL   Child   HadoopCL   Child   HadoopCL   Child  
    • HADOOPCL  ARCHITECTURE   Task   Map  or  Reduce   ‒  Data  is  buffered  in  chunks  for   processing  on  the  OpenCL  device   !  HadoopCL  explicitly  manages  buffers   to  prevent  large  GC  overheads   !  Kernel  Executor  handles   ‒  Auto-­‐generaHon  and  opHmizaHon  of   OpenCL  kernels  from  JVM  bytecode   ‒  Transfer  of  inputs,  outputs  to  device   ‒  Asynchronous  launch  of  OpenCL   kernels   9   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   Input   Buffer   Queue   Launch   Retry   OpenCL   Device   Output   Output   Buffer   Kernel   Queue     Executor   Input   Collector   Input   Buffer   Rele ase   !  Each  Child  JVM  encloses  a  data-­‐ driven  pipeline  of   communicaHon  and  computaHon   tasks   HadoopCL  Child   Input   Buffer   Manager   Output   Buffer   Manager  
    • TOPICS  IN  HADOOPCL   !  Extending  APARAPI  with  architecture-­‐  and  data-­‐aware  compiler  opHmizaHons   1.  A  number  of  HadoopCL-­‐specific  funcHons  are  auto-­‐generated  from  APARAPI  at  runHme   2.  When  GPU  execuHon  is  detected  and  a  vector  data-­‐type  is  in  use,  the  HadoopCL  runHme   auto-­‐strides  input  vectors  before  copying  to  the  device   ‒  APARAPI  must  emit  strided  code  to  match  data  layout,  fails  in  certain  cases   double  MahoutKMeansMapper__dot(...){      double  agg  =  0.0;      for  (int  i  =  0;  i  <  length1;  i++){          int  currentIndex  =  index1[(i)  *  this-­‐>nPairs];          int  j  =  0;          for  (;  j<length2  &&  currentIndex!=index2[j];  j++)  ;          if  (j  !=  length2)              agg  =  agg  +  (val1[(i)  *  this-­‐>nPairs]  *  val2[j]);      }      return(agg);   }   10   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   double  MahoutKMeansMapper__dot(...){      double  agg  =  0.0;      for  (int  i  =  0;  i  <  length1;  i++){          int  currentIndex  =  index1[i];          int  j  =  0;          for  (;  j<length2  &&  currentIndex!=index2[j];  j++)  ;          if  (j!=length2)              agg  =  agg  +  (val1[i]  *  val2[j]);      }      return(agg);   }  
    • TOPICS  IN  HADOOPCL   !  Enabling  OpenCL  dynamic  memory  allocaHon  through  restart-­‐able  kernels   ‒ Note:  there  are  no  side  effects  of  mappers  or  reducers  unHl  they  commit  (i.e.  write())   OpenCL  Device   Heap   public  void  map(int  key,  double  val)  {      int[]  outputVec  =  new  int[10];      ...      write(key,  outputVec);   }                  Mapper.java   free   nWrites   nInputs   writeOffsetLookup   11   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL   __kernel  void  map(int  key,  double  val)  {      int  oldOffset  =  atomic_add(free,  10);      if  (oldOffset  +  10  >=  heapSize)  {          nWrites[inputIndex]  =  -­‐1;          return;      }      ...      writeOffsetLookup[inputIndex]  =  oldOffset;      nWrites[inputIndex]  =  nWrites[inputIndex]  +  1;   }                      Mapper.cl  
    • TOPICS  IN  HADOOPCL   !  Auto-­‐scheduling  OpenCL  kernels  across  execuHon  plaQorms  through  machine  learning   ‒ HadoopCL  TaskTracker  is  responsible  for   1.  Assigning  each  Task  an  execuHon  plaQorm  (GPU,  CPU,  or  JVM)   2.  Recording  execuHon  Hme  for  each  task  along  with  the  kernel  executed  and  average  device   occupancy  during  that  task’s  execuHon   !  Device  assignment  is  based  on  programmer  hints  and/or  recorded  data  from  previous   runs   ‒  Data  is  recorded  in  files  to  be  used  across  Jobs   12   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • EVALUATION   !  Mahout  Kmeans   ‒ Mahout  provides  Hadoop  MapReduce   implementaHons  of  a  variety  of  ML  algorithms   ‒ KMeans  iteraHvely  searches  for  K  clusters   !  HadoopCL  KMeans  port   ‒ Mapper  is  trivial,  for  each  point  iterates  through   all  clusters  and  outputs  the  closest   ‒ Reducer  is  more  complex   ‒ Both  OpenCL  and  Java  versions  implemented,  as   HadoopCL  allows  the  programmer  to  force  JVM   execuHon   13   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • EVALUATION   !  Evaluated  on  a  10-­‐node  AMD  APU  cluster   !  Two  datasets  with  varying  parameters  tested   ‒ Wiki  data  set   ‒ ASF  e-­‐mail  archives  data  set   ‒ Varied  K,  the  number  of  clusters   ‒ Varied  the  type  of  pruning  done  on  the  input  data   (prune  all  but  the  N  most  frequent  tokens  vs.  prune   each  vector  to  be  at  most  length  M)   ‒ Varied  the  amount  of  pruning  done  (i.e.  the  values  of   N  and  M)   ‒ Enable  and  disable  HadoopCL  features  to  observe   impact  on  performance   14   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • EVALUATION   !  Graphs  here   15   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • CONCLUSION   !  HadoopCL  offers  the  flexibility,  reliability,  and   programmability  of  Hadoop  accelerated  by  naHve,   heterogeneous  OpenCL  threads   !  Using  HadoopCL  is  a  tradeoff:  lose  parts  of  the  Java   language  but  gain  improved  performance   !  EvaluaHon  of  KMeans  with  real-­‐world  data  sets  shows   that  HadoopCL  is  flexible  and  efficient  enough  to   improve  performance  of  real-­‐world  applicaHons   !  Future  work  to  target  HSA  instead  of  OpenCL       Max  Grossman,  jmg3@rice.edu   16   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • DISCLAIMER  &  ATTRIBUTION   The  informaHon  presented  in  this  document  is  for  informaHonal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.     The  informaHon  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap   changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  souware   changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obligaHon  to  update  or  otherwise  correct  or  revise  this  informaHon.  However,  AMD   reserves  the  right  to  revise  this  informaHon  and  to  make  changes  from  Hme  to  Hme  to  the  content  hereof  without  obligaHon  of  AMD  to  noHfy  any  person  of   such  revisions  or  changes.     AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY   INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.     AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE   LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION   CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.     ATTRIBUTION   ©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combinaHons  thereof  are  trademarks  of  Advanced  Micro  Devices,   Inc.  in  the  United  States  and/or  other  jurisdicHons.    SPEC    is  a  registered  trademark  of  the  Standard  Performance  EvaluaHon  CorporaHon  (SPEC).  Other   names  are  for  informaHonal  purposes  only  and  may  be  trademarks  of  their  respecHve  owners.   17   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL  
    • SAMPLE  SHAPES   18   |      PRESENTATION  TITLE      |      NOVEMBER  21,  2013      |      CONFIDENTIAL