Accelerating Particle Image
 Velocimetry using Hybrid
       Architectures
        Vivek Venugopal
       Cameron Patterso...
Cardio-Vascular Disease (CVD)
 Cardio Vascular Disease (CVD)
 Cancer
 Other                                              5...
Research@AEThER Lab




                                              Intravascular
 Pressure distribution     Location of...
Research@AEThER Lab
    !                                        !




                                !                  ...
Motivation

                                        Case 1

                         Stent                       Each case...
FFT-based PIV algorithm
t               (16 x16) or
                (32 x 32) or
                                         ...
!! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=!
Implementation platform
            B@C!        ()*+,"1DEA"./0"
            ...
GPU platform
 GPU

                                                          Multiprocessor 30   Host (CPU)               ...
GPU platform
              CPU                   CUDA enabled program on GPU



                              kernel 10   ...
FPGA platform
                                              ML310 board 1                                                 ...
PRO-PART design flow
                               CFG representation
                                 of application
    ...
CUDA implementation results
                           zone_create
                         real2complex
                 ...
Results
                                                 1.1
                                                             ...
Conclusions

•   Nvidia's Tesla C1060 GPU faster than both the
    sequential CPU implementation and the NOFIS
    FPGA im...
Discussion
             Questions
Research Justification
                                                                                           Aurora
  ...
Communication Flow Graph

       a1        a2   N1                   N1


                           l1                   ...
PRO-PART design flow
                                                                                             Inside ML...
Evaluating PRO-PART + NOFIS
    Current implementation                            Proposed implementation
                ...
NOFIS Hardware




                 ML310 Infiniband
                  adapter board

      ML310
Upcoming SlideShare
Loading in …5
×

Accelerating Particle Image Velocimetry using Hybrid Architectures

854 views

Published on

High Performance Computing (HPC) applications are mapped to a cluster of multi-core processors communicating using high speed interconnects. More computational power is harnessed with the addition of hardware accelerators such as Graphics Processing Unit (GPU) cards and Field Programmable Gate Arrays (FPGAs). Particle Image Velocimetry (PIV) is an embarrassingly parallel application that can benefit from acceleration using hybrid architectures. The PIV application is mapped to a Nvidia GPU system, resulting in 3x speedup over a dual quad-core Intel processor implementation. The design methodology used to implement the PIV application on a specialized FPGA platform under development is described in brief and the resulting performance benefit is analyzed.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
854
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Accelerating Particle Image Velocimetry using Hybrid Architectures

  1. 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal Cameron Patterson Kevin Shinpaugh
  2. 2. Cardio-Vascular Disease (CVD) Cardio Vascular Disease (CVD) Cancer Other 5% 5% Chronic Lung Respiratory Disease 6% Accidents 37% Alzheimerʼs Disease 24% 24% Cause of Deaths in United States for 2005 (reference: American Heart Association)
  3. 3. Research@AEThER Lab Intravascular Pressure distribution Location of stent within the heart Region of Interest • Research focuses on cardiovascular fluid dynamics and stent hemodynamics useful in designing intravascular stents for treating CVD.
  4. 4. Research@AEThER Lab ! ! ! ! ! ! ! PIV setup DPIV camera capturing the "#$%&'!()*+!,-.$ ! Laser illuminated fluid flow 4503!6'127+!89020!01!':2#&'!/'2%3!;'10&'!#:/2.66.2 >'//' 4503!&#$927+!?.2.!.@A%#/#2#0:!.:=!@0:2&06! • The AEThER lab uses the Particle Image Velocimetry 4C0220-!6'127+!!(!.D#/!-020&#<'=!2&.>'&/'!10&!. 10@%/#: 4C0220-!&#$927+!G2.:=.&=!?8,H!@.-'&.!#/!&'3 (PIV) technique to track the motion of particles in an ! illuminated flow field.
  5. 5. Motivation Case 1 Stent Each case = Case 2 configuration 1 1250 image pairs x 5 MB = 6.25 GB PIV Stent experimental configuration 2 setup Case 100 Stent configuration 20 • Time for FlowIQ analysis of one image pair on 2GHz Xeon processor = 16 minutes. • 1250 image pairs x 100 cases x 20 stent configurations = 2.6 years
  6. 6. FFT-based PIV algorithm t (16 x16) or (32 x 32) or Motion (64 x 64) Vector Image 1 FFT t + dt zone 1 Multiplication IFFT Reduction Image 2 FFT zone 2 • Peak detection using FFT blocks • Reduction block: sub-pixel algorithm and filtering to determine the velocity value
  7. 7. !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=! Implementation platform B@C! ()*+,"1DEA"./0" K/)!DE$F$G!K)*3-!#LMN! 7"I!.5:',109C!+5-(2!0*!-9! -22B09!.-(2!+-*)2!59!1/)!K)*3-! #LMN!7"I=!$1!/-*!-!"#$! %&'()**!<,33B/)0C/1!<5(:!<-.15(! -92!0*!1-(C)1)2!-*!-!/0C/! ')(<5(:-9.)!.5:',109C! HO"#J!*53,1059!<5(!"#$! %&'()**!/5*1!;5(A*1-1059*=! 7"I!#5:',109C!+5-(2B3)4)3! '(52,.1*!25!951!/-4)!20*'3-?! .599).15(*!-92!-()!*').0<0.-33?! 2)*0C9)2!<5(!.5:',109C=! "(5.)**5(!.35.A*@!:):5(?! F;G5:)"BHC-"()*+,"1DEA"./0" • System G: 325 Apple Mac Pro nodes, 2 quad-core .59<0C,(-1059!-92!.5:',109C! <)-1,()*!;033!)4534)!20<<)()913?! 1/-9!C(-'/0.*!+5-(2!'(52,.1*=! Intel Xeon processors with 8 GB RAM per node K)*3-!#LMN!7"I!*').0<0.-1059*P! running Linux CentOS !! Q9)!7"I!HRSL!1/()-2!'(5.)**5(*J! ! !! R=T!78!2)20.-1)2!:):5(?! • Tesla C1060 U01*!09!59)!<,33B3)9C1/@!2,-3!*351!;01/!59)!5')9!"#$!%&'()**!&RV!*351! at 1.3 !! : 240 streaming cores clocked GHz with 4 GB onboard memory
  8. 8. GPU platform GPU Multiprocessor 30 Host (CPU) Device (GPU) Grid Multiprocessor 2 Multiprocessor 1 Block Block Block Shared Memory (0,0) (0,1) (0,2) kernel 1 Block Block Block Registers Registers Registers (1,0) (1,1) (1,2) Instruction Processor 1 Processor 2 Processor 8 Unit Block (1,1) Thread Thread Thread Thread Constant (0,0) (0,1) (0,2) (0,3) Cache Thread Thread Thread Thread (1,0) (1,1) (1,2) (1,3) Texture Thread Thread Thread Thread Cache Block (2,0) Block (2,1) (2,2) Block (2,3) (0,0) (0,1) (0,2) kernel 2 Block Block Block GPU memory (1,0) (1,1) (1,2) CUDA hardware CUDA programming model model for Tesla C1060 • Nvidia Tesla C1060: 30 streaming multi-processors with 8 cores each (SIMD) • Kernel-based execution
  9. 9. GPU platform CPU CUDA enabled program on GPU kernel 10 kernel 11 kernel 1n PIV sequential code cudaThreadSynchronize() in C consisting of 15 functions looped 3096 times kernel 20 kernel 21 kernel 2n cudaThreadSynchronize() kernel 30 kernel 31 kernel 3n cudaThreadSynchronize() kernel 150 kernel 151 kernel 15n cudaThreadSynchronize() index0 index1 indexn n = loop size = 3096
  10. 10. FPGA platform ML310 board 1 ML310 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches FSL FSL PE3 PE4 PE3 PE4 Aurora switches Aurora switches Network Of FPGAs with Integrated Aurora Switches (NOFIS)
  11. 11. PRO-PART design flow CFG representation of application algorithm PRO-PART + Design flow component specification of implementation platform NOFIS platform Input specifications • Automatic synthesis of CFG captures the communication component communication cores depending specification flow on the Process Partitioning partitioning and communication resource • Library of cores depending on the specification automated communication link configure and generate values for communication • Determine max. flow control and cores configure the parameters mapping to hardware
  12. 12. CUDA implementation results zone_create real2complex complex_mul complex_conj conj_symm CUDA kernel functions c2c_radix2_sp find_corrmax transpose x_field y_field indx_reorder CUDA 2.1 with memcopy meshgrid_x CUDA 2.2 with memcopy meshgrid_y CUDA 2.2 with zerocopy fft_indx cc_indx memcopy 1 10 100 1000 10000 100000 1000000 GPU Time in usecs CUDA Profile graph comparison
  13. 13. Results 1.1 CPU 1.05 GPU FPGA 0.825 • Linear speedup to the Time in seconds 0.65 number of cores 0.55 • GPU 3x speedup 0.34 0.275 0 Execution Device
  14. 14. Conclusions • Nvidia's Tesla C1060 GPU faster than both the sequential CPU implementation and the NOFIS FPGA implementation for the PIV application • FPGAs with custom communication interfaces provide better throughput • PRO-PART design flow expected to reduce design development time (work in progress) • Ideal Hybrid platform consisting of CPU and GPU for computation and FPGA for data movement
  15. 15. Discussion Questions
  16. 16. Research Justification Aurora PE PE PE ML310 ML310 1 2 6 board 1 board 2 PE PE PE Aurora 7 8 12 ML310 ML310 board 3 board 4 PE PE PE 31 32 36 ML310 boards connected in Clock source mesh driven by same clock value but different sources Streaming architecture with large number of PEs requiring more than 1 board Aurora switches FSL PE PE PE 1 2 3 Aurora switches Aurora switches FSL PE PE PE 4 5 6 PE PE PE 7 8 9 Clock source Aurora switches Inside a ML310
  17. 17. Communication Flow Graph a1 a2 N1 N1 l1 l2 b1 b2 N2 N2 l3 l4 c1 N3 l5 d1 N4 l6 e1 N5 l7
  18. 18. PRO-PART design flow Inside ML310 N1 N1 l1 l2 PE PE N2 N2 data_in1 I/O FSL I/O data_out1 ML310 board1 l3 l4 ML310 board2 FSL FSL N3 data_in2 data_out2 I/O FSL I/O l5 PE PE N4 l6 N5 l7
  19. 19. Evaluating PRO-PART + NOFIS Current implementation Proposed implementation (work in progress) • Time-consuming and • Shorter design cycle extensive simulations • Potential increase in resource • Hardware efficient due to but meets performance goals hand-coded RTL CFG captures the component design capture communication specification flow partitioning and partitioning and scheduling of processes communication resource specification manual redesign loop automated select parameters for configure and generate the customizable cores values for communication cores map the values on the hardware mapping to hardware
  20. 20. NOFIS Hardware ML310 Infiniband adapter board ML310

×