HSA	
  ENABLEMENT	
  OF	
  APARAPI	
  
	
  EASING	
  THE	
  DEVELOPER	
  PATH	
  TO	
  APU/GPU	
  ACCELERATED	
  JAVA	
  A...
HSA	
  ENABLEMENT	
  OF	
  APARAPI	
  :	
  AGENDA	
  

!  Java GPU enablement via Aparapi
‒  Why Java?
‒  Aparapi
‒  What ...
WHY	
  JAVA?	
  
!  Java	
  by	
  the	
  numbers	
  	
  
‒ 9	
  Million	
  Developers	
  
‒ 1	
  Billion	
  Java	
  downlo...
INITIAL	
  APARAPI	
  PROJECT	
  OVERVIEW	
  (2011)	
  
!  Open Source framework

	
  
Java	
  Applica[on	
  

!  Allows J...
MEET	
  HSA	
  AND	
  HSAIL	
  
!  Heterogeneous	
  System	
  Architecture	
  standardizes	
  CPU/GPU	
  func[onality	
  
...
APARAPI	
  HSA	
  ENABLEMENT	
  (2013-­‐2014)	
  
	
  
Java	
  Applica[on	
  

!  Open	
  Source	
  project	
  sponsored	
...
HSA	
  AND	
  LAMBDA	
  ENABLED	
  APARAPI	
  EXECUTION	
  EXAMPLE	
  

	
  

Does	
  PlaLorm	
  
Supports	
  HSA?	
  

Y
...
SUMATRA	
  PROJECT	
  :	
  NATIVE	
  SUPPORT	
  FOR	
  GPU	
  OFFLOAD	
  ADDED	
  TO	
  JAVA	
  
!  AMD/Oracle	
  sponsore...
HSA	
  ENABLEMENT	
  OF	
  JAVA	
  
Java	
  7	
  –	
  OpenCL	
  enabled	
  Aparapi	
  	
  
	
  

Java	
  8	
  –	
  	
  HSA...
A	
  CASE	
  STUDY	
  CENTERED	
  ON	
  NBODY	
  
!  A	
  Java	
  developer	
  implemen[ng	
  a	
  sequen[al	
  version	
 ...
WITHOUT	
  HSA	
  WE	
  CAN’T	
  (EFFICIENTLY)	
  USE	
  OBJECTS	
  	
  
!  In	
  Java;	
  allocated	
  Objects	
  are	
  ...
HSA	
  ENABLED	
  APARAPI	
  (AND	
  SUMATRA)	
  ALLOWS	
  USE	
  OF	
  OBJECTS	
  
!  So	
  we	
  code	
  our	
  Body	
  ...
‒ Step	
  0:	
  Generate	
  HSAIL	
  from	
  Bytecode	
  
‒ Step	
  1:	
  Generate	
  host	
  HSA	
  Run[me	
  calls	
  
‒...
HIGH	
  LEVEL	
  HSA	
  FEATURES	
  

! Features	
  currently	
  being	
  defined	
  in	
  the	
  HSA	
  Working	
  Groups*...
HSA	
  INTERMEDIATE	
  LANGUAGE	
  (HSAIL)**	
  
!  HSAIL	
  is	
  a	
  virtual	
  ISA	
  for	
  parallel	
  programs	
  
...
HSAIL	
  OVERVIEW**	
  
INSTRUCTION	
  SET	
  

!  Similar	
  to	
  assembly	
  language	
  for	
  a	
  RISC	
  CPU	
  
‒ ...
SEGMENTS	
  AND	
  MEMORY	
  **	
  
!  7	
  segments	
  of	
  memory	
  
‒  global,	
  readonly,	
  group,	
  spill,	
  pr...
EXAMPLE	
  –	
  BYTECODE	
  TO	
  HSAIL	
  GENERATION	
  
Generated HSAIL

javac –g squares.java

int in[], out[];
Device....
APARAPI	
  JNI	
  CALL	
  -­‐>	
  HSA	
  RUNTIME	
  API	
  

Device	
  Discovery	
  &	
  Queue	
  Crea[on	
  APIs**	
  
! ...
APARAPI	
  JNI	
  -­‐>	
  HSA	
  RUNTIME	
  API	
  
Finalize	
  HSAIL	
  to	
  GPU	
  ISA**	
  
!  Transla[ng	
  HSAIL	
  ...
APARAPI	
  JNI	
  -­‐>	
  POPULATION	
  OF	
  AQL	
  DISPATCH	
  PACKET	
  
!  AQL	
  Dispatch	
  Packet**	
  
‒  Header	
...
POPULATING	
  KERNEL	
  INFO	
  AND	
  SIGNAL	
  USING	
  HSA	
  RT	
  API**	
  
HsaStatus	
  HsaFinalizeBrig(const	
  Hsa...
DISPATCH	
  AND	
  WAIT	
  ON	
  KERNEL	
  COMPLETION	
  
!  Dispatch	
  
‒  Submit	
  AQL	
  Packet	
  into	
  the	
  Hsa...
DEMO	
  

24	
   |	
  	
  	
  HSA	
  ENABLEMENT	
  	
  OF	
  APARAPI	
  	
  	
  |	
  NOVEMBER	
  2013|	
  
SUMMARY	
  
!  Aparapi	
  is	
  already	
  an	
  establish	
  framework	
  for	
  simplifying	
  execu[on	
  of	
  Java	
 ...
QUESTIONS	
  &	
  ANSWERS?	
  

26	
   |	
  	
  	
  HSA	
  ENABLEMENT	
  	
  OF	
  APARAPI	
  	
  	
  |	
  NOVEMBER	
  201...
DISCLAIMER	
  &	
  ATTRIBUTION	
  

The	
  informa[on	
  presented	
  in	
  this	
  document	
  is	
  for	
  informa[onal	...
Upcoming SlideShare
Loading in...5
×

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

1,020

Published on

Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,020
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

  1. 1. HSA  ENABLEMENT  OF  APARAPI    EASING  THE  DEVELOPER  PATH  TO  APU/GPU  ACCELERATED  JAVA  APPLICATIONS   VIGNESH  RAVI  –  SOFTWARE  DEVELOPER  HSA  TEAM  AMD     GARY  FROST  –  SOFTWARE  FELLOW  AMD  
  2. 2. HSA  ENABLEMENT  OF  APARAPI  :  AGENDA   !  Java GPU enablement via Aparapi ‒  Why Java? ‒  Aparapi ‒  What is it and how is it used? !  Introduction to HSA !  How HSA simplifies Java GPU programming with Aparapi ‒  Simpler programming model using lambda expressions ‒  Removal of previous constraints thanks to SVM (Shared Virtual Memory) !  The nuts and bolts of our current HSA enablement ‒  HSAIL generation ‒  Dispatch via HSA Runtime APIs !  Summary !  Q&A 2   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  3. 3. WHY  JAVA?   !  Java  by  the  numbers     ‒ 9  Million  Developers   ‒ 1  Billion  Java  downloads  per  year   ‒ 97%    Enterprise  desktops  run  Java   ‒ 100%    of  blue  ray  players  ship  with  Java   hVp://oracle.com.edgesuite.net/[meline/java/   !  Java  7  language  &  libraries  already  include  concurrency  features     ‒ primi[ves  (threads,  locks,  monitors,  atomic  ops)   ‒ libraries  (fork/join,  thread  pools,  executors,  futures)   !  Upcoming  Java  8  include  stream  processing  enhancements   ‒ support  for  ‘lambda’    expressions     ‒ Lambda  centric  concurrent  stream  processing  libs/apis     (java.u[l.stream.*)       3   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  4. 4. INITIAL  APARAPI  PROJECT  OVERVIEW  (2011)   !  Open Source framework   Java  Applica[on   !  Allows Java developers access to GPU compute Overload  Aparapi  KKernel  Base   Overload  Aparapi   ernel  Class’s    run()  method   Class’s  run()  method   !  Aparapi Java API for expressing data parallel workloads Aparapi  converts   bytecode  to   OpenCL™     Kernel kernel = new Kernel(){ @Override public void run(){ int i=getGlobalID(); square[i]=in[i]*in[i]; } }; kernel.execute(size); !  Aparapi runtime capable of converting bytecode to OpenCL™ ‒  Execution on OpenCL™ 1.1+ capable devices (GPUs and APUs) Or… ‒  Execute via a thread pool if OpenCL™ is unavailable.   4   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   OpenCL™ OpenCL™ compiler & Runtime JVM CPU ISA CPU GPU ISA GPU
  5. 5. MEET  HSA  AND  HSAIL   !  Heterogeneous  System  Architecture  standardizes  CPU/GPU  func[onality   ‒ Be  ISA-­‐agnos[c  for  both  CPUs  and  accelerators   ‒ Support  high-­‐level  programming  languages   ‒ Provide  the  ability  to  access  pageable  system  memory  from  the  GPU   ‒ Maintain  cache  coherency  for  system  memory  between  CPU  and  GPU   !  Specifica[ons  and  simulator  from  HSA  Founda[on   ‒ HSAIL  portable  ISA  is    “finalized”  to  par[cular  hardware  ISA  at  run[me   ‒ Run[me  specifica[on  for  job  launch  and  control   ‒ HSAIL™  simulator  for  development  and  tes[ng  before  hardware  availability   5   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  6. 6. APARAPI  HSA  ENABLEMENT  (2013-­‐2014)     Java  Applica[on   !  Open  Source  project  sponsored     !  Enhanced  to  support  HSA  and  Java  8  lambda  expression   Aparapi  Lambda  based    API   Aparapi  converts  bytecode  to   HSAIL   Device.hsa().forEach(size, i -> square[i]=in[i]*in[i] ); HSAIL HSA Finalizer & Runtime   !  Allow  developers  to  efficiently  represent  data  parallel  algorithms   using  new  Java  8  Lambda  expressions   !  API’s  have  same  look  &  feel  as  proposed  Java  8  stream  API  features   !  No  modifica[ons  to  the  JVM.       ‒  We  provide  external  JNI/Java  libraries.     6   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   JVM CPU ISA CPU GPU ISA GPU
  7. 7. HSA  AND  LAMBDA  ENABLED  APARAPI  EXECUTION  EXAMPLE     Does  PlaLorm   Supports  HSA?   Y N Y Can  bytecode  be   converted  to   HSAIL?   N   Device.hsa().forEach(size, i -> square[i]=int[i]*int[i] ); Is  this  the  first   execuAon  of  this   lambda    instance?   Y Execute  Kernel   using  Java   thread  Pool   Convert   bytecode  to   HSAIL   N N   Do  we  have  HSAIL   for  this  lambda  ?     7   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   Y Execute    HSAIL   Kernel  on   GPU/APU  
  8. 8. SUMATRA  PROJECT  :  NATIVE  SUPPORT  FOR  GPU  OFFLOAD  ADDED  TO  JAVA   !  AMD/Oracle  sponsored  Open  Source  (OpenJDK)  project   !  Targeted  at  OpenJDK  Java  9  (2015)     Java  Applica[on   !  Allow  developers  to  efficiently  represent  data  parallel  algorithms  in  Java  using   Stream  API  +  Lambda  expressions   Java  JDK  Stream  +  Lambda   API   !  Sumatra  is  not  pushing  new  ‘programming  model’     Java  GRAAL  JIT   backend   !  Instead  we  ‘repurpose’  Stream  API  +  Lambda  to  enable  both  CPU  or  GPU   compu[ng   HSAIL !  A  Sumatra  enabled  Java  Virtual  Machine™  will  dispatch  ‘selected’  constructs  to  HSA   enabled  devices  at  run[me.   !  Developers  already  refactoring  JDK  to  use  stream  &  lambda  API’s   ‒  So  anyone  using  exis[ng  JDK  should  see  GPU  accelera[on  without  any  code  changes.   !  Links:   ‒  hVp://openjdk.java.net/projects/sumatra   ‒  hVps://wikis.oracle.com/display/HotSpotInternals/Sumatra   ‒  hVp://mail.openjdk.java.net/pipermail/sumatra-­‐dev   8   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   HSA Finalizer & Runtime JVM CPU ISA CPU GPU ISA GPU
  9. 9. HSA  ENABLEMENT  OF  JAVA   Java  7  –  OpenCL  enabled  Aparapi       Java  8  –    HSA  enabled  Aparapi     Java  9  –  HSA  enabled  Java  (Sumatra)     •  Java  8  brings  Stream  +  Lambda  API.   More  natural  way  of  expressing  data  parallel   algorithms     Ini[ally  targeted  at  mul[-­‐core.     •  APARAPI  will  :-­‐   Support  Java  8  Lambdas     Dispatch  code  to  HSA  enabled  devices  at  run[me  via   HSAIL   •  Adds  na[ve  GPU  compute  support  to  Java  Virtual  Machine   (JVM)       •  Developer  uses  JDK  provided    Lambda  +  Stream  API     •  AMD  ini[ated  Open  Source  project     •  APIs  for  data  parallel  algorithms     GPU  accelerate  Java  applica[ons   No  need  to  learn  OpenCL     •  Ac[ve  community  captured  mindshare   ~20  contributors    >7000  downloads   ~150  visits  per  day   We  plan  to  provide     HSA  Enabled  Aparapi  (Java  8)   as  a  bridge  technology  between     OpenCL  based  Aparapi  (Java  7)    and     HSA  Enabled  Sumatra  (Java  9)     Java  Applica[on     Java  Applica[on     APARAPI  +    Lambda  API   OpenCL™ Java  JDK  Stream  +  Lambda  API     Java  GRAAL  JIT  backend     HSAIL™ HSAIL™ OpenCL™  Compiler  and   Run[me    HSA  Finalizer  &  Run[me   JVM    HSA™  Finalizer  &  Run[me   JVM   JVM   GPU ISA CPU   •  JVM  uses  GRAAL  compiler  to  generate  HSAIL       •  JVM  decides  at  run[me  to  execute  on  either  CPU  or  GPU   depending  on  workload  characteris[cs.       Java  Applica[on     APARAPI     API   CPU ISA   GPU   9   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   GPU ISA CPU ISA CPU   GPU   GPU ISA CPU ISA CPU   GPU  
  10. 10. A  CASE  STUDY  CENTERED  ON  NBODY   !  A  Java  developer  implemen[ng  a  sequen[al  version  of  NBody  would  probably…   ‒  Create  a  class    to  represent  each  body   class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){ // accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z); } }   !  Loop  through  each  Body  (in  array  of  bodies[])  to  update  and  display   for (Body b: bodies) b.updateAndShow(screen, bodies); 10   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  11. 11. WITHOUT  HSA  WE  CAN’T  (EFFICIENTLY)  USE  OBJECTS     !  In  Java;  allocated  Objects  are  scaVered  on  the  heap.   ‒  There  is  no  way  to  allocate  an  array  of  objects  in  con[guous  memory  (as    with  C++)   ‒  We  force  the  developer  to  resort  to  using  parallel  arrays  of  primi[ves  (which  are  con[guous)     float x[], y[], z[], m[], vx,[], vy[], vz[]; ‒  And  to  infer  that      x[n],  y[n]  and  z[n]  holds  the  state  for  bodies[n].   Kernel kernel = new Kernel(){ public void run(){ int i = getGlobalId(0); for (int j=0; j<bodies.length; j++){ // accum forces between (x,y,z)[j] and (x,y,z)[i] } // update vx[j],vy[j],vz[j],x[j],y[j] and z[j] } }; ‒  Then  the  kernel    can  be  used  to  execute  the    code  on  the  GPU   Kernel.execute(bodies.length); 11   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  12. 12. HSA  ENABLED  APARAPI  (AND  SUMATRA)  ALLOWS  USE  OF  OBJECTS   !  So  we  code  our  Body  class  exactly  as  we  would  if  execu[ng  in  Java.     class Body{ float x,y,z,m,vx,vy,vz; // Include method to update position and display void updateAndShow(Screen screen, Body[] bodies){ for (Body other:bodies){ // accumulate forces between other and this } // update vx,vy,vz,x,y and z from accumulated data screen.paint(x,y,z); } }   !  Then  use  new  Aparapi  lambda  enabled  API  to  coordinate  dispatch  to  theGPU   Device.hsa().forEach(bodies, b -> { b.updateAndShow(screen, bodies); }); 12   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  13. 13. ‒ Step  0:  Generate  HSAIL  from  Bytecode   ‒ Step  1:  Generate  host  HSA  Run[me  calls   ‒ Step  1.1:  Ini[alize  HSA  run[me,  device,  queue   …   ‒ Step  1.2:  Finalize  HSAIL  to  generate  GPU  ISA   ‒ Step  1.3:  Bind  Java  args  to  HSA  args   ‒ Step  1.4:  Dispatch  the  kernel   ‒ Step  1.5:  Wait  for  comple[on   ‒ Repeat  steps  1.3  -­‐  1.5  for  next  itera[on  of  same   kernel   ‒ Repeat  step  0  –  1  for  each  new  kernel     MyLambda.java javac (compiler) MyLambda.class Runtime !  HSA  enabled  Aparapi,  at  run[me:   Development time OVERVIEW  OF  HSA  ENABLED  APARAPI   Application Aparapi Generate HSA RT calls Initialize JVM Contains CPU ISA Finalize Bind Args CPU GPU Dispatch GPU ISA 13   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   Generate HSAIL Input
  14. 14. HIGH  LEVEL  HSA  FEATURES   ! Features  currently  being  defined  in  the  HSA  Working  Groups**   ‒ Unified  addressing  across  all  processors   ‒ Opera[on  into  pageable  system  memory   ‒ Full  memory  coherency   ‒ Pla|orm    atomics   ‒ User  mode  dispatch   ‒ Enables  fast  dispatch  with  no  driver  involvement   ‒ Architected  queuing  language   ‒ Flexible  compute  dispatch,  easier  GPU  self-­‐enqueue   ‒ High  level  language  support  for  GPU  compute  processors   ‒ Preemp[on  and  context  switching     **  All  features  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   14   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   @  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  
  15. 15. HSA  INTERMEDIATE  LANGUAGE  (HSAIL)**   !  HSAIL  is  a  virtual  ISA  for  parallel  programs   ‒ Finalized  to  vendor-­‐specific  ISA  by  a  JIT  compiler  or  “Finalizer”   ‒ ISA  independent  by  design  for  CPU  &  GPU   !  Explicitly  parallel   ‒ Designed  for  data  parallel  programming   !  Support  for  excep[ons,  virtual  func[ons,  and  other  high  level  language  features   !  Lower  level  than  OpenCL™  SPIR   ‒ Fits  naturally  in  the  OpenCL™  compila[on  stack   !  Suitable  to  support  addi[onal  high  level  languages  and  programming  models:   ‒ Java,  C++,  OpenMP,  etc   **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   15   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   @  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  
  16. 16. HSAIL  OVERVIEW**   INSTRUCTION  SET   !  Similar  to  assembly  language  for  a  RISC  CPU   ‒  Load-­‐store  architecture   ld_global_u64 $d0, [$d6 + 120]; $d0= load($d6+120) add_u64 $d1= $d2+24 $d1, $d2, 24; !  136  opcodes  (Java™  bytecode  has  200)   ‒  Floa[ng  point  (single,  double,  half  (f16))   ‒  Integer  (32-­‐bit,  64-­‐bit)   ‒  Some  packed  opera[ons     ‒  Branches   ‒  Func[on  calls   ‒  Pla$orm  Atomic  Opera[ons:    and,  or,  xor,  exch,  add,  sub,   inc,  dec,  max,  min,  cas   ‒  Synchronize  host  CPU  and  HSA  Component!   !  Text  and  Binary  formats  (“BRIG”)   REGISTERS   !  Four  classes  of  registers   ‒  C:  1-­‐bit,  Control  Registers   ‒  S:  32-­‐bit,  Single-­‐precision  FP  or  Int   ‒  D:  64-­‐bit,  Double-­‐precision  FP  or  Long  Int   ‒  Q:  128-­‐bit,  Packed  data.   !  Fixed  number  of  registers:   ‒  8  C     ‒  S,  D,  Q  share  a  single  pool  of  resources   S + 2*D + 4*Q <= 128 Up to 128 S or 64 D or 32 Q (or a blend) !  Register  alloca[on  done  in  high-­‐level   compiler     ‒  Finalizer  doesn’t  have  to  perform  expensive   register  alloca[on     **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   16   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   @  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  
  17. 17. SEGMENTS  AND  MEMORY  **   !  7  segments  of  memory   ‒  global,  readonly,  group,  spill,  private,  arg,  kernarg,     ‒  Memory  instruc[ons  can  (op[onally)  specify  a  segment   !  Global  Segment   !  Kernarg  Segment   ‒  Programmer  writes  kernarg  segment  to  pass   arguments  to  a  kernel   !  Read-­‐Only  Segment   ‒  Visible  to  all  HSA  agents  (including  host  CPU)   ‒  Remains  constant  during  execu[on  of  kernel   ‒  HSAIL  provides  sync  opera[ons  to  control  visibility  of   group  memory   addressing   ‒  Very  useful  for  high-­‐level  language  support  (ie   classes,  libraries)   ‒  Aligns  well  with  OpenCL  2.0  “generic”  addressing   feature   ld_global_u64 $d0, [$d6] !  Flat  Addressing   !  Group  Segment   ld_group_u64 $d0,[$d6+24] ‒  Each  segment  mapped  into  virtual  address  space   ‒  Provides  high-­‐performance  memory  shared  in  the  work-­‐ st_spill_f32 $s1,[$d6+4] can  map  to  segments  based  on   ‒  Flat  addresses   group.   ld_kernarg_u64 $d6,virtual  address   [%_arg0] ‒  Group  memory  can  be  read  and  wriVen  by  any  work-­‐ ‒  Instruc[ons  with  n item  in  the  work-­‐group   ld_u64 $d0,[$d6+24] ; flat o  explicit  segment  use  flat   !  Spill,  Private,  Arg  Segments   ‒  Represent  different  regions  of  a  per-­‐work-­‐item  stack   ‒  Typically  generated  by  compiler,  not  specified  by   programmer     **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   17   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   @  Copyright  2012  HSA  Founda[on.  All  Rights  Reserved.  
  18. 18. EXAMPLE  –  BYTECODE  TO  HSAIL  GENERATION   Generated HSAIL javac –g squares.java int in[], out[]; Device.hsa().forEach(len, i-> out[i] = in[i] * in[i] ); 18   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   0: aload_0 //out[] 1: iload_2 //i 2: aload_1 //in[] 3: iload_2 4: iaload 5: aload_1 6: iload_2 7: iaload 8: imul 9: iastore 10: return version 0:95: $full : $large; kernel &run( kernarg_u64 %_arg0, //out[] kernarg_u64 %_arg1, //in[] kernarg_s32 %_arg2 ){ ld_kernarg_u64 $d0, [%_arg0]; ld_kernarg_u64 $d1, [%_arg1]; ld_kernarg_s32 $s2, [%_arg2]; workitemabsid_u32 $s2, 0; //i mov_b64 $d3, $d0; mov_b32 $s4, $s2; mov_b64 $d5, $d1; mov_b32 $s6, $s2; cvt_u64_s32 $d6, $s6; mad_u64 $d6, $d6, 4, $d5; ld_global_s32 $s5, [$d6+24]; mov_b64 $d6, $d1; mov_b32 $s7, $s2; cvt_u64_s32 $d7, $s7; mad_u64 $d7, $d7, 4, $d6; ld_global_s32 $s6, [$d7+24]; mul_s32 $s5, $s5, $s6; cvt_u64_s32 $d4, $s4; mad_u64 $d4, $d4, 4, $d3; st_global_s32 $s5, [$d4+24]; ret; };
  19. 19. APARAPI  JNI  CALL  -­‐>  HSA  RUNTIME  API   Device  Discovery  &  Queue  Crea[on  APIs**   !  Discover  HSA  Device   ‒  Both  count  and  device_list  are  out  params   ‒  User  can  iterate  over  HSA  devices  in  the  list   !  User-­‐Mode  Queue  Crea[on   ‒  User  can  provide  pre-­‐allocated  buffer   ‒  If  not,  API  will  allocate  a  buffer   ‒  queue  is  the  user-­‐mode  queue   HsaStatus  HsaGetDevices(unsigned  int  *count,                                                                  const  HsaDevice  **device_list);   HsaStatus  HsaCreateUserModeQueue(const  HsaDevice  *device,                                                                                    void  *buffer,  size_t  buffer_size,        HsaQueuePriority  queue_priority,                                                                                    HsaQueueFrac[on  queue_frac[on,                                                                                    HsaQueue  **queue);     **  All  APIs  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   19   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  20. 20. APARAPI  JNI  -­‐>  HSA  RUNTIME  API   Finalize  HSAIL  to  GPU  ISA**   !  Transla[ng  HSAIL  text  to  Binary  (BRIG)   ‒  BRIG  is  a  binary  container  for  several  sec[ons   ‒  Code   ‒  String   ‒  Direc[ve   ‒  …   ‒  libHsail  is  an  assembler/disassembler   ‒  This  is  a  standalone  compiler  library   ‒  Not  part  of  Run[me   !  Finalize  Brig  to  IHV  specific  GPU  ISA   ‒  Input:  Brig   ‒  Output:  HsaKernelCode  which  contains  ISA   Status  Assemble  (const  char*  hsail_text,  HsaBrig  *brig);   HsaStatus  HsaFinalizeBrig(const  HsaDevice  *device,                                                                        HsaBrig  *brig,                                                                        const  char  *kernel_name,                                                                        const  char  *op[ons,                                                                        HsaKernelCode  **kernel);   **  All  APIs  subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   20   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  21. 21. APARAPI  JNI  -­‐>  POPULATION  OF  AQL  DISPATCH  PACKET   !  AQL  Dispatch  Packet**   ‒  Header  enables:   ‒  Different  packet  types   ‒  Specify  if  this  packet  should  wait  for  all  previous  to   complete   ‒  Control  visibility  of  data  and  memory  fences  before   and  aƒer  dispatch   ‒  Body  enables:   ‒  Specify  the  problem  fan  out  using  launch  config   related  fields   ‒  How  much  workgroup  memory?   ‒  Loca[on  of  IHV  specific  GPU  ISA   ‒  Loca[on  of  where  kernelargs  can  be  found   ‒  A  signal  mechanism  to  wait  on  kernel  comple[on   !  Only  popula[ng  Kernel  info  and  signal  are   opaque,  so  require  run[me  APIs   typedef  struct  HsaAqlDispatchPacket  {   uint32_t  format  :  8;   uint32_t  barrier  :  1;   uint32_t  acquire_fence_scope  :  2;   Header  Fields   uint32_t  release_fence_scope  :  2;   uint32_t  invalidate_instruction_cache  :  1;   uint32_t  invalidate_roi_image_cache  :  1;   uint32_t  dimensions  :  2;   uint32_t  reserved  :  15;   uint16_t  workgroup_size[3];   Launch  Config   uint16_t  reserved2;   uint32_t  grid_size[3];   uint32_t  private_segment_size_bytes;   uint32_t  group_segment_size_bytes;   Kernel  Info   uint64_t  kernel_object_address;   uint64_t  kernel_arg_address;   uint64_t  reserved3;   uint64_t  completion_signal;   Kernel  SynchronizaAon   }  HsaAqlDispatchPacket;   ‒  Other  fields    are  open,  so  simple  assignments   **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   21   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  22. 22. POPULATING  KERNEL  INFO  AND  SIGNAL  USING  HSA  RT  API**   HsaStatus  HsaFinalizeBrig(const  HsaDevice  *device,                                                                        HsaBrig  *brig,                                                                        const  char  *kernel_name,                                                                        const  char  *op[ons,                                                                        HsaKernelCode  **kernel);   typedef  struct  HsaKernelCode  {        …        uint32_t  workitem_private_segment_byte_size;        uint32_t  workgroup_group_segment_byte_size;        uint64_t  kernarg_segment_byte_size;              …   }  HsaKernelCode;     typedef  struct  HsaAqlDispatchPacket  {        …        uint32_t  private_segment_size_bytes;        uint32_t  group_segment_size_bytes;        uint64_t  kernel_object_address;        uint64_t  kernel_arg_address;        …        uint64_t  completion_signal;   }   HsaStatus  HsaCreateSignal(HsaSignal  *signal);   **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   22   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   Pack  Java  Args  into  a  vector  in  JNI   Register  vector  data  address   HsaStatus  HsaRegisterSystemMemory(void  *address,  size_t  size);  
  23. 23. DISPATCH  AND  WAIT  ON  KERNEL  COMPLETION   !  Dispatch   ‒  Submit  AQL  Packet  into  the  HsaQueue   ‒  Thread  safe  API   HsaStatus  HsaSubmitAql(HsaQueue  *queue,HsaAqlDispatchPacket  *aql_packet);   !  Wait  on  Kernel  Comple[on   bool  is_done  =  false;   while  (!is_done)  {          status  =  HsaQuerySignal(signal,  &is_done);          assert(status  ==  kHsaStatusSuccess);   }   **  Subject  to  change,  pending  comple[on  and  ra[fica[on  of  specifica[ons  in  the  HSA  Working  Groups   23   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|   !  Aƒer  comple[on,  disposing  HSA  resources   ‒  Release  queue   ‒  Release  signal   ‒  Release  Kernel  object   ‒  Deregister  kernel  args  related  memory   HsaStatus  HsaDestroyUserModeQueue(HsaQueue  *queue);   HsaStatus  HsaDestroySignal(HsaSignal  signal);   HsaStatus  HsaFreeKernelCode(HsaKernelCode  *kernel);   HsaStatus  HsaDeregisterSystemMemory(void  *address);    
  24. 24. DEMO   24   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  25. 25. SUMMARY   !  Aparapi  is  already  an  establish  framework  for  simplifying  execu[on  of  Java  on  GPU  devices   !  HSA  enabled  Aparapi  further  simplifies  GPU  accelera[on  of  Java  applica[ons   ‒  Aligns  with  Java  8  features  to  support  ‘lambda’  expression  for  compactness   ‒  Enables  ‘large  unified’  system  memory  for  GPU  accelera[on   ‒  Eases  programming  by  enabling  direct  access  to  Java  objects  on  heap   ‒  Enables  fast  offload  of  Java  kernels  through  User-­‐mode  queue  and  AQL   !  HSA  enabled  Aparapi  lends  to  more  interes[ng  future  possibili[es   ‒  Simplified  communica[on  and  workload  balancing  across  both  CPU  and  GPU   ‒  Exploit  new  computa[on  paVerns  and  recursions  through  kernel  self-­‐enqueue     25   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  26. 26. QUESTIONS  &  ANSWERS?   26   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  27. 27. DISCLAIMER  &  ATTRIBUTION   The  informa[on  presented  in  this  document  is  for  informa[onal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.     The  informa[on  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap   changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  soƒware   changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obliga[on  to  update  or  otherwise  correct  or  revise  this  informa[on.  However,  AMD   reserves  the  right  to  revise  this  informa[on  and  to  make  changes  from  [me  to  [me  to  the  content  hereof  without  obliga[on  of  AMD  to  no[fy  any  person  of   such  revisions  or  changes.     AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY   INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.     AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE   LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION   CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.     ATTRIBUTION   ©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combina[ons  thereof  are  trademarks  of  Advanced  Micro  Devices,   Inc.  in  the  United  States  and/or  other  jurisdic[ons.  OpenCL  is  a  trademark  of  Apple  Inc.    HSA  is  a  trademark  of  the  Heterogeneous  System  Architecture   Founda[on.  Other  names  are  for  informa[onal  purposes  only  and  may  be  trademarks  of  their  respec[ve  owners.   27   |      HSA  ENABLEMENT    OF  APARAPI      |  NOVEMBER  2013|  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×