Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Particle Image Velocimetry using Hybrid Architectures

High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from acceleration using hybrid architectures. The PIV application is mapped to a Nvidia GPU system, resulting in 3x speedup over a dual quad-core Intel processor implementation. The design methodology used to implement the PIV application on a specialized FPGA platform under development is described in brief and the resulting performance benefit is analyzed.

  • Login to see the comments

  • Be the first to like this

Accelerating Particle Image Velocimetry using Hybrid Architectures

  1. 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal Cameron Patterson Kevin Shinpaugh
  2. 2. Cardio-Vascular Disease (CVD) Cardio Vascular Disease (CVD) Cancer Other 5% 5% Chronic Lung Respiratory Disease 6% Accidents 37% Alzheimerʼs Disease 24% 24% Cause of Deaths in United States for 2005 (reference: American Heart Association)
  3. 3. Research@AEThER Lab Intravascular Pressure distribution Location of stent within the heart Region of Interest • Research focuses on cardiovascular fluid dynamics and stent hemodynamics useful in designing intravascular stents for treating CVD.
  4. 4. Research@AEThER Lab ! ! ! ! ! ! ! PIV setup DPIV camera capturing the "#$%&'!()*+!,-.$ ! Laser illuminated fluid flow 4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2 >'//' 4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06! • The AEThER lab uses the Particle Image Velocimetry 4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!. 10@%/#: 4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3 (PIV) technique to track the motion of particles in an ! illuminated flow field.
  5. 5. Motivation Case 1 Stent Each case = Case 2 configuration 1 1250 image pairs x 5 MB = 6.25 GB PIV Stent experimental configuration 2 setup Case 100 Stent configuration 20 • Time for FlowIQ analysis of one image pair on 2GHz Xeon processor = 16 minutes. • 1250 image pairs x 100 cases x 20 stent configurations = 2.6 years
  6. 6. FFT-based PIV algorithm t (16 x16) or (32 x 32) or Motion (64 x 64) Vector Image 1 FFT t + dt zone 1 Multiplication IFFT Reduction Image 2 FFT zone 2 • Peak detection using FFT blocks • Reduction block: sub-pixel algorithm and filtering to determine the velocity value
  7. 7. !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=! Implementation platform B@C! ()*+,"1DEA"./0" K/)!DE$F$G!K)*3-!#LMN! 7"I!.5:',109C!+5-(2!0*!-9! -22B09!.-(2!+-*)2!59!1/)!K)*3-! #LMN!7"I=!$1!/-*!-!"#$! %&'()**!<,33B/)0C/1!<5(:!<-.15(! -92!0*!1-(C)1)2!-*!-!/0C/! ')(<5(:-9.)!.5:',109C! HO"#J!*53,1059!<5(!"#$! %&'()**!/5*1!;5(A*1-1059*=! 7"I!#5:',109C!+5-(2B3)4)3! '(52,.1*!25!951!/-4)!20*'3-?! .599).15(*!-92!-()!*').0<0.-33?! 2)*0C9)2!<5(!.5:',109C=! "(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0" • System G: 325 Apple Mac Pro nodes, 2 quad-core .59<0C,(-1059!-92!.5:',109C! <)-1,()*!;033!)4534)!20<<)()913?! 1/-9!C(-'/0.*!+5-(2!'(52,.1*=! Intel Xeon processors with 8 GB RAM per node K)*3-!#LMN!7"I!*').0<0.-1059*P! running Linux CentOS !! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! ! !! R=T!78!2)20.-1)2!:):5(?! • Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3 !! : 240 streaming cores clocked GHz with 4 GB onboard memory
  8. 8. GPU platform GPU Multiprocessor 30 Host (CPU) Device (GPU) Grid Multiprocessor 2 Multiprocessor 1 Block Block Block Shared Memory (0,0) (0,1) (0,2) kernel 1 Block Block Block Registers Registers Registers (1,0) (1,1) (1,2) Instruction Processor 1 Processor 2 Processor 8 Unit Block (1,1) Thread Thread Thread Thread Constant (0,0) (0,1) (0,2) (0,3) Cache Thread Thread Thread Thread (1,0) (1,1) (1,2) (1,3) Texture Thread Thread Thread Thread Cache Block (2,0) Block (2,1) (2,2) Block (2,3) (0,0) (0,1) (0,2) kernel 2 Block Block Block GPU memory (1,0) (1,1) (1,2) CUDA hardware CUDA programming model model for Tesla C1060 • Nvidia Tesla C1060: 30 streaming multi-processors with 8 cores each (SIMD) • Kernel-based execution
  9. 9. GPU platform CPU CUDA enabled program on GPU kernel 10 kernel 11 kernel 1n PIV sequential code cudaThreadSynchronize() in C consisting of 15 functions looped 3096 times kernel 20 kernel 21 kernel 2n cudaThreadSynchronize() kernel 30 kernel 31 kernel 3n cudaThreadSynchronize() kernel 150 kernel 151 kernel 15n cudaThreadSynchronize() index0 index1 indexn n = loop size = 3096
  10. 10. FPGA platform ML310 board 1 ML310 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches FSL FSL PE3 PE4 PE3 PE4 Aurora switches Aurora switches Network Of FPGAs with Integrated Aurora Switches (NOFIS)
  11. 11. PRO-PART design flow CFG representation of application algorithm PRO-PART + Design flow component specification of implementation platform NOFIS platform Input specifications • Automatic synthesis of CFG captures the communication component communication cores depending specification flow on the Process Partitioning partitioning and communication resource • Library of cores depending on the specification automated communication link configure and generate values for communication • Determine max. flow control and cores configure the parameters mapping to hardware
  12. 12. CUDA implementation results zone_create real2complex complex_mul complex_conj conj_symm CUDA kernel functions c2c_radix2_sp find_corrmax transpose x_field y_field indx_reorder CUDA 2.1 with memcopy meshgrid_x CUDA 2.2 with memcopy meshgrid_y CUDA 2.2 with zerocopy fft_indx cc_indx memcopy 1 10 100 1000 10000 100000 1000000 GPU Time in usecs CUDA Profile graph comparison
  13. 13. Results 1.1 CPU 1.05 GPU FPGA 0.825 • Linear speedup to the Time in seconds 0.65 number of cores 0.55 • GPU 3x speedup 0.34 0.275 0 Execution Device
  14. 14. Conclusions • Nvidia's Tesla C1060 GPU faster than both the sequential CPU implementation and the NOFIS FPGA implementation for the PIV application • FPGAs with custom communication interfaces provide better throughput • PRO-PART design flow expected to reduce design development time (work in progress) • Ideal Hybrid platform consisting of CPU and GPU for computation and FPGA for data movement
  15. 15. Discussion Questions
  16. 16. Research Justification Aurora PE PE PE ML310 ML310 1 2 6 board 1 board 2 PE PE PE Aurora 7 8 12 ML310 ML310 board 3 board 4 PE PE PE 31 32 36 ML310 boards connected in Clock source mesh driven by same clock value but different sources Streaming architecture with large number of PEs requiring more than 1 board Aurora switches FSL PE PE PE 1 2 3 Aurora switches Aurora switches FSL PE PE PE 4 5 6 PE PE PE 7 8 9 Clock source Aurora switches Inside a ML310
  17. 17. Communication Flow Graph a1 a2 N1 N1 l1 l2 b1 b2 N2 N2 l3 l4 c1 N3 l5 d1 N4 l6 e1 N5 l7
  18. 18. PRO-PART design flow Inside ML310 N1 N1 l1 l2 PE PE N2 N2 data_in1 I/O FSL I/O data_out1 ML310 board1 l3 l4 ML310 board2 FSL FSL N3 data_in2 data_out2 I/O FSL I/O l5 PE PE N4 l6 N5 l7
  19. 19. Evaluating PRO-PART + NOFIS Current implementation Proposed implementation (work in progress) • Time-consuming and • Shorter design cycle extensive simulations • Potential increase in resource • Hardware efficient due to but meets performance goals hand-coded RTL CFG captures the component design capture communication specification flow partitioning and partitioning and scheduling of processes communication resource specification manual redesign loop automated select parameters for configure and generate the customizable cores values for communication cores map the values on the hardware mapping to hardware
  20. 20. NOFIS Hardware ML310 Infiniband adapter board ML310