Accelerating PIV using hybrid architectures


Published on

Poster for 2009 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'09)

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Accelerating PIV using hybrid architectures

  1. 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh,, Introduction t (16 x16) or (32 x 32) or FFT-based PIV algorithm Cardio Vascular Disease (CVD) 5% 4%3% (64 x 64) Motion • Each image is broken down into overlapping Cancer 6% 35% zones and the corresponding zones from both Other Vector Chronic Lung Respiratory Disease Image 1 images are correlated. Accidents 24% t + dt FFT • The peak detection is done using the FFT Alzheimer’s Disease zone 1 block, which is then translated to depict the Diabetes 23% Multiplication IFFT Reduction direction of the particles within the zone. • The reduction block consists of a sub-pixel Cause of Deaths in United States for 2005 FFT algorithm and a filtering routine to compute Image 2 • The Advanced Experimental Thermofluids Engineering zone 2 the velocity value. (AEThER) Lab at Virginia Tech is involved in the area of cardiovascular fluid dynamics and stent hemodynamics, which is instrumental for designing intravascular stents for treating CVD. 325 Apple Mac Pro nodes 2 quad-core Intel Xeon processors with Implementation platforms and Results zone_create • The AEThER lab uses the Particle Image Velocimetry 8 GB RAM per node running Linux real2complex 1.25 (PIV) technique to track the motion of particles in an CentOS complex_mul complex_conj CPU GPU System G platform illuminated flow field. FPGA CUDA kernel functions conj_symm 1.00 1.05 GPU c2c_radix2_sp Host (CPU) Device (GPU) find_corrmax Time in seconds Multiprocessor 30 Multiprocessor 2 Grid transpose 0.75 Multiprocessor 1 Block (0,0) Block (0,1) Block (0,2) x_field 0.65 Shared Memory kernel 1 Block Block Block y_field Registers Registers Registers (1,0) (1,1) (1,2) indx_reorder 0.50 Instruction CUDA 2.1 with memcopy Processor 1 Processor 2 Processor 8 Unit Block (1,1) meshgrid_x CUDA 2.2 with memcopy Thread Thread Thread Thread meshgrid_y CUDA 2.2 with zerocopy 0.34 0.25 Constant (0,0) (0,1) (0,2) (0,3) Cache Thread (1,0) Thread (1,1) Thread (1,2) Thread (1,3) fft_indx Texture Cache Thread Block (2,0) Thread Thread Block (2,1) (2,2) Thread Block (2,3) cc_indx (0,0) (0,1) (0,2) memcopy kernel 2 0 Location of Pressure distribution GPU memory Block (1,0) Block (1,1) Block (1,2) 1.000 26.591 707.107 18803.015 500000.000 Execution Device Region of Interest within the heart GPU time in usecs CUDA hardware model for Tesla C1060 CUDA programming model CUDA profile graph comparison Execution time comparison SFG representation of application Case 1 algorithm PRO-PART + Conclusion Design flow ML310 board 1 ML310 board 2 component specification of Stent Each case = Aurora switches Aurora switches implementation platform Case 2 NOFIS platform configuration 1 1250 image pairs FSL FSL Input specifications Aurora • Nvidia's Tesla C1060 GPU provides computational speedup as compared to both PE1 PE2 PE1 PE2 x 5 MB = 6.25 GB Aurora switches Aurora switches Aurora switches Aurora switches PIV FSL FSL Stent experimental configuration 2 the sequential CPU implementation and the NOFIS implementation for the PIV PE3 PE4 PE3 PE4 SAFC dataflow structure and setup capture using SFG components specification Case 100 application. Aurora switches Aurora switches • The synchronization and the latency in data movement can be optimized between Aurora Aurora ML310 board 4 ML310 board 3 partitioning and Stent Aurora switches Aurora switches communication resource specification the FPGAs by having custom communication interfaces without an Operating configuration 20 FSL FSL automated PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches System (OS) overhead. FSL FSL configure and generate Time for FlowIQ analysis of one image pair on 2GHz values for communication cores • The PRO-PART design flow facilitates a fast and easier mapping of a streaming PE3 PE4 PE3 PE4 Xeon processor = 16 minutes. So time required for Aurora switches Aurora switches complete dataset, 1250 image pairs x 100 cases x 20 mapping to hardware application on the specialized NOFIS platform with a significant advantage in stent configurations = 2.6 years Multi-Core SAFC hardware: NOFIS platform PRO-PART design flow development time. References [1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: [2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME. [3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: