Accelerating PIV using hybrid architectures

Vivek Venugopalan
Vivek VenugopalanResearch Scientist at United Technologies Research Center
Accelerating Particle Image Velocimetry using Hybrid Architectures
                                                                                         Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh
                                                                                                                                                                                                                        vivekv@vt.edu, cdp@vt.edu, kashin@vt.edu


 Introduction                                                                            t                                                                                 (16 x16) or
                                                                                                                                                                           (32 x 32) or
                                                                                                                                                                                                                                                                                                                                                                           FFT-based PIV algorithm
      Cardio Vascular Disease (CVD)                        5%
                                                             4%3%                                                                                                          (64 x 64)
                                                                                                                                                                                                                                                                                                                                                                                                  Motion
                                                                                                                                                                                                                                                                                                                                                                                                                               • Each image is broken down into overlapping
      Cancer                                             6%                 35%                                                                                                                                                                                                                                                                                                                                                zones and the corresponding zones from both
      Other                                                                                                                                                                                                                                                                                                                                                                                       Vector
      Chronic Lung Respiratory Disease                                                       Image 1                                                                                                                                                                                                                                                                                                                           images are correlated.
      Accidents                                       24%                                t + dt
                                                                                                                                                                                                                                                               FFT
                                                                                                                                                                                                                                                                                                                                                                                                                               • The peak detection is done using the FFT
      Alzheimer’s Disease                                                                                                                                                            zone 1                                                                                                                                                                                                                                    block, which is then translated to depict the
      Diabetes                                                      23%                                                                                                                                                                                                                                                 Multiplication                                                 IFFT                Reduction           direction of the particles within the zone.
                                                                                                                                                                                                                                                                                                                                                                                                                               • The reduction block consists of a sub-pixel
                 Cause of Deaths in United States for 2005                                                                                                                                                                                                     FFT                                                                                                                                                             algorithm and a filtering routine to compute
                                                                                             Image 2
 • The Advanced Experimental Thermofluids Engineering                                                                                                                                 zone 2
                                                                                                                                                                                                                                                                                                                                                                                                                               the velocity value.
 (AEThER) Lab at Virginia Tech is involved in the area of
 cardiovascular fluid dynamics and stent hemodynamics,
 which is instrumental for designing intravascular stents
 for treating CVD.
                                                                                                                                                                                325 Apple Mac Pro nodes
                                                                                                                                                                                2 quad-core Intel Xeon processors with
                                                                                                                                                                                                                                                                                                                                                        Implementation platforms and Results
                                                                                                                                                                                                                                                                                                                                                                              zone_create
 • The AEThER lab uses the Particle Image Velocimetry                                                                                                                           8 GB RAM per node running Linux                                                                                                                                                             real2complex
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1.25
 (PIV) technique to track the motion of particles in an                                                                                                                         CentOS                                                                                                                                                                                      complex_mul
                                                                                                                                                                                                                                                                                                                                                                           complex_conj
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CPU
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  GPU
                                                                                                                                                                                System G platform
 illuminated flow field.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            FPGA




                                                                                                                                                                                                                                                                                                                                                   CUDA kernel functions
                                                                                                                                                                                                                                                                                                                                                                               conj_symm                                                                                           1.00   1.05
                                                                                                                 GPU                                                                                                                                                                                                                                                       c2c_radix2_sp
                                                                                                                                                                                                                                                                 Host (CPU)                                   Device (GPU)
                                                                                                                                                                                                                                                                                                                                                                            find_corrmax




                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Time in seconds
                                                                                                                                                                                                                    Multiprocessor 30



                                                                                                                                                                                                               Multiprocessor 2
                                                                                                                                                                                                                                                                                                  Grid
                                                                                                                                                                                                                                                                                                                                                                                transpose                                                                                          0.75
                                                                                                                                                                                                           Multiprocessor 1                                                                         Block
                                                                                                                                                                                                                                                                                                    (0,0)
                                                                                                                                                                                                                                                                                                                   Block
                                                                                                                                                                                                                                                                                                                   (0,1)
                                                                                                                                                                                                                                                                                                                                 Block
                                                                                                                                                                                                                                                                                                                                 (0,2)                                              x_field
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        0.65
                                                                                                                                            Shared Memory
                                                                                                                                                                                                                                                                   kernel 1
                                                                                                                                                                                                                                                                                                    Block          Block         Block                                              y_field
                                                                                                                      Registers             Registers                          Registers                                                                                                            (1,0)          (1,1)         (1,2)
                                                                                                                                                                                                                                                                                                                                                                            indx_reorder                                                                                           0.50
                                                                                                                                                                                                               Instruction                                                                                                                                                                                                     CUDA 2.1 with memcopy
                                                                                                                      Processor 1           Processor 2                        Processor 8                        Unit
                                                                                                                                                                                                                                                                                                               Block (1,1)                                                     meshgrid_x                                      CUDA 2.2 with memcopy
                                                                                                                                                                                                                                                                                                     Thread    Thread   Thread   Thread
                                                                                                                                                                                                                                                                                                                                                                               meshgrid_y                                      CUDA 2.2 with zerocopy                                            0.34
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0.25
                                                                                                                                                                                                                  Constant                                                                            (0,0)     (0,1)    (0,2)    (0,3)
                                                                                                                                                                                                                    Cache
                                                                                                                                                                                                                                                                                                     Thread
                                                                                                                                                                                                                                                                                                      (1,0)
                                                                                                                                                                                                                                                                                                               Thread
                                                                                                                                                                                                                                                                                                                (1,1)
                                                                                                                                                                                                                                                                                                                        Thread
                                                                                                                                                                                                                                                                                                                         (1,2)
                                                                                                                                                                                                                                                                                                                                 Thread
                                                                                                                                                                                                                                                                                                                                  (1,3)
                                                                                                                                                                                                                                                                                                                                                                                  fft_indx
                                                                                                                                                                                                                      Texture
                                                                                                                                                                                                                       Cache
                                                                                                                                                                                                                                                                                                     Thread
                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                      (2,0)
                                                                                                                                                                                                                                                                                                               Thread Thread
                                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                                (2,1)   (2,2)
                                                                                                                                                                                                                                                                                                                                 Thread
                                                                                                                                                                                                                                                                                                                                  Block
                                                                                                                                                                                                                                                                                                                                  (2,3)
                                                                                                                                                                                                                                                                                                                                                                                  cc_indx
                                                                                                                                                                                                                                                                                                    (0,0)          (0,1)          (0,2)
                                                                                                                                                                                                                                                                                                                                                                                memcopy
                                                                                                                                                                                                                                                                   kernel 2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0
             Location of                    Pressure distribution                                                                                                         GPU memory
                                                                                                                                                                                                                                                                                                    Block
                                                                                                                                                                                                                                                                                                    (1,0)
                                                                                                                                                                                                                                                                                                                   Block
                                                                                                                                                                                                                                                                                                                   (1,1)
                                                                                                                                                                                                                                                                                                                                 Block
                                                                                                                                                                                                                                                                                                                                 (1,2)                                                    1.000      26.591          707.107         18803.015      500000.000                            Execution Device
          Region of Interest                  within the heart
                                                                                                                                                                                                                                                                                                                                                                                                               GPU time in usecs
                                                                                              CUDA hardware model for Tesla C1060                                                                                                                                         CUDA programming model
                                                                                                                                                                                                                                                                                                                                                                                                           CUDA profile graph comparison                                               Execution time comparison
                                                                                                                                                                                                                                                                 SFG representation
                                                                                                                                                                                                                                                                   of application
                                                     Case 1                                                                                                                                                                                                          algorithm                    PRO-PART
                                                                                                                                                                                                                                                                         +


                                                                                                                                                                                                                                                                                                                                                                           Conclusion
                                                                                                                                                                                                                                                                                                  Design flow
                                                                                                                                              ML310 board 1                                                                           ML310 board 2
                                                                                                                                                                                                                                                                     component
                                                                                                                                                                                                                                                                   specification of
                              Stent                                      Each case =                                      Aurora switches                                                                         Aurora switches
                                                                                                                                                                                                                                                               implementation platform
                                                     Case 2                                                                                                                                                                                                                                                                       NOFIS platform
                          configuration 1                              1250 image pairs
                                                                                                                              FSL                                                                                       FSL
                                                                                                                                                                                                                                                                 Input specifications
                                                                                                                                                                            Aurora




                                                                                                                                                                                                                                                                                                                                                             • Nvidia's Tesla C1060 GPU provides computational speedup as compared to both
                                                                                                                        PE1            PE2                                                                      PE1             PE2
                                                                      x 5 MB = 6.25 GB
                                                                                                                                                        Aurora switches




                                                                                                                                                                                                                                             Aurora switches
                                                                                              Aurora switches




                                                                                                                                                                                       Aurora switches




      PIV                                                                                                       FSL                                                                                      FSL

                              Stent
  experimental
                          configuration 2
                                                                                                                                                                                                                                                                                                                                                             the sequential CPU implementation and the NOFIS implementation for the PIV
                                                                                                                        PE3            PE4                                                                      PE3             PE4                                                      SAFC dataflow          structure and
     setup                                                                                                                                                                                                                                                                               capture using
                                                                                                                                                                                                                                                                                             SFG
                                                                                                                                                                                                                                                                                                               components
                                                                                                                                                                                                                                                                                                               specification
                                                    Case 100
                                                                                                                                                                                                                                                                                                                                                             application.
                                                                                                                          Aurora switches                                                                         Aurora switches




                                                                                                                                                                                                                                                                                                                                                             • The synchronization and the latency in data movement can be optimized between
                                                                                                                                            Aurora                                                         Aurora
                                                                                                                                              ML310 board 4                                                                           ML310 board 3                                              partitioning and
                               Stent                                                                                      Aurora switches                                                                         Aurora switches
                                                                                                                                                                                                                                                                                              communication resource
                                                                                                                                                                                                                                                                                                  specification


                                                                                                                                                                                                                                                                                                                                                             the FPGAs by having custom communication interfaces without an Operating
                          configuration 20                                                                                     FSL                                                                                      FSL
                                                                                                                                                                                                                                                                                                                  automated
                                                                                                                        PE1            PE2                                  Aurora                              PE1             PE2
                                                                                                                                                        Aurora switches




                                                                                                                                                                                                                                             Aurora switches
                                                                                              Aurora switches




                                                                                                                                                                                       Aurora switches




                                                                                                                                                                                                                                                                                                                                                             System (OS) overhead.
                                                                                                                FSL                                                                                      FSL                                                                                  configure and generate


 Time for FlowIQ analysis of one image pair on 2GHz                                                                                                                                                                                                                                          values for communication
                                                                                                                                                                                                                                                                                                       cores



                                                                                                                                                                                                                                                                                                                                                             • The PRO-PART design flow facilitates a fast and easier mapping of a streaming
                                                                                                                        PE3            PE4                                                                      PE3             PE4




 Xeon processor = 16 minutes. So time required for                                                                        Aurora switches                                                                         Aurora switches




 complete dataset, 1250 image pairs x 100 cases x 20
                                                                                                                                                                                                                                                                                               mapping to hardware
                                                                                                                                                                                                                                                                                                                                                             application on the specialized NOFIS platform with a significant advantage in
 stent configurations = 2.6 years                                                             Multi-Core SAFC hardware: NOFIS platform                                                                                                                                            PRO-PART design flow                                                         development time.


References
[1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: http://www.americanheart.org/presenter.jhtml?identifier=4478
[2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering
Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME.
[3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: http://www.nvidia.com/object/product_tesla_c1060_us.html
1 of 1

Recommended

Russia by
RussiaRussia
RussiaEugene Dolomanji
901 views28 slides
Bela Fazenda 02 by
Bela Fazenda 02Bela Fazenda 02
Bela Fazenda 02delsar
214 views4 slides
Earth This Beautiful World by
Earth This Beautiful WorldEarth This Beautiful World
Earth This Beautiful WorldDINISHA
639 views23 slides
A beautiful world by
A beautiful worldA beautiful world
A beautiful worldBella Meraki
665 views19 slides
Beautiful Flower Collection by
Beautiful Flower CollectionBeautiful Flower Collection
Beautiful Flower CollectionThilini
1.1K views28 slides
03. исследование часть hr бренд в уа от hh.ua и prp by
03. исследование часть hr бренд в уа от hh.ua и prp03. исследование часть hr бренд в уа от hh.ua и prp
03. исследование часть hr бренд в уа от hh.ua и prpRTC
461 views10 slides

More Related Content

Viewers also liked

Beautiful world places by
Beautiful world placesBeautiful world places
Beautiful world placesStelian Ciocarlie
276 views39 slides
30 Most Beautiful Waterfalls in India by
30 Most Beautiful Waterfalls in India30 Most Beautiful Waterfalls in India
30 Most Beautiful Waterfalls in IndiaShine Gold Tours India
24.3K views32 slides
Beautiful World of Snow by
Beautiful World of SnowBeautiful World of Snow
Beautiful World of SnowRen
4.1K views58 slides
Quelle beaute d jk1 by
Quelle beaute d jk1Quelle beaute d jk1
Quelle beaute d jk1Renée Bukay
500 views42 slides
Hardware Trojans By - Anupam Tiwari by
Hardware Trojans By - Anupam TiwariHardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam TiwariOWASP Delhi
1.2K views83 slides
Waterfalls by
WaterfallsWaterfalls
WaterfallsMarco Belzoni
1.2K views60 slides

Viewers also liked(8)

Beautiful World of Snow by Ren
Beautiful World of SnowBeautiful World of Snow
Beautiful World of Snow
Ren 4.1K views
Hardware Trojans By - Anupam Tiwari by OWASP Delhi
Hardware Trojans By - Anupam TiwariHardware Trojans By - Anupam Tiwari
Hardware Trojans By - Anupam Tiwari
OWASP Delhi1.2K views
The Most Amazing Waterfalls by Joanne Creasy
The Most Amazing Waterfalls The Most Amazing Waterfalls
The Most Amazing Waterfalls
Joanne Creasy1.3K views
Worlds scariest bridge 2014 by Eugene Koh
Worlds scariest bridge 2014Worlds scariest bridge 2014
Worlds scariest bridge 2014
Eugene Koh525 views

More from Vivek Venugopalan

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions by
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsVivek Venugopalan
186 views1 slide
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA by
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGAVivek Venugopalan
301 views1 slide
Accelerating Real-Time LiDAR Data Processing Using GPUs by
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUsVivek Venugopalan
1.1K views12 slides
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ... by
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Vivek Venugopalan
937 views1 slide
Real-time processing for ATST by
Real-time processing for ATSTReal-time processing for ATST
Real-time processing for ATSTVivek Venugopalan
763 views28 slides
Accelerating Particle Image Velocimetry using Hybrid Architectures by
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesVivek Venugopalan
600 views20 slides

More from Vivek Venugopalan(7)

xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions by Vivek Venugopalan
xDEFENSE: An Extended DEFENSE for mitigating Next Generation IntrusionsxDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
xDEFENSE: An Extended DEFENSE for mitigating Next Generation Intrusions
Vivek Venugopalan186 views
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA by Vivek Venugopalan
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGADesign, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Design, Implementation and Security Analysis of Hardware Trojan Threats in FPGA
Vivek Venugopalan301 views
Accelerating Real-Time LiDAR Data Processing Using GPUs by Vivek Venugopalan
Accelerating Real-Time LiDAR Data Processing Using GPUsAccelerating Real-Time LiDAR Data Processing Using GPUs
Accelerating Real-Time LiDAR Data Processing Using GPUs
Vivek Venugopalan1.1K views
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ... by Vivek Venugopalan
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core ...
Vivek Venugopalan937 views
Accelerating Particle Image Velocimetry using Hybrid Architectures by Vivek Venugopalan
Accelerating Particle Image Velocimetry using Hybrid ArchitecturesAccelerating Particle Image Velocimetry using Hybrid Architectures
Accelerating Particle Image Velocimetry using Hybrid Architectures
Vivek Venugopalan600 views

Recently uploaded

Jibachha publishing Textbook.docx by
Jibachha publishing Textbook.docxJibachha publishing Textbook.docx
Jibachha publishing Textbook.docxDrJibachhaSahVetphys
47 views14 slides
On Killing a Tree.pptx by
On Killing a Tree.pptxOn Killing a Tree.pptx
On Killing a Tree.pptxAncyTEnglish
66 views11 slides
Dance KS5 Breakdown by
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 BreakdownWestHatch
86 views2 slides
Java Simplified: Understanding Programming Basics by
Java Simplified: Understanding Programming BasicsJava Simplified: Understanding Programming Basics
Java Simplified: Understanding Programming BasicsAkshaj Vadakkath Joshy
316 views155 slides
ACTIVITY BOOK key water sports.pptx by
ACTIVITY BOOK key water sports.pptxACTIVITY BOOK key water sports.pptx
ACTIVITY BOOK key water sports.pptxMar Caston Palacio
745 views4 slides
Ch. 8 Political Party and Party System.pptx by
Ch. 8 Political Party and Party System.pptxCh. 8 Political Party and Party System.pptx
Ch. 8 Political Party and Party System.pptxRommel Regala
53 views11 slides

Recently uploaded(20)

Dance KS5 Breakdown by WestHatch
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 Breakdown
WestHatch86 views
Ch. 8 Political Party and Party System.pptx by Rommel Regala
Ch. 8 Political Party and Party System.pptxCh. 8 Political Party and Party System.pptx
Ch. 8 Political Party and Party System.pptx
Rommel Regala53 views
AUDIENCE - BANDURA.pptx by iammrhaywood
AUDIENCE - BANDURA.pptxAUDIENCE - BANDURA.pptx
AUDIENCE - BANDURA.pptx
iammrhaywood89 views
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant... by Ms. Pooja Bhandare
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
Pharmaceutical Inorganic Chemistry Unit IVMiscellaneous compounds Expectorant...
Ms. Pooja Bhandare109 views
Use of Probiotics in Aquaculture.pptx by AKSHAY MANDAL
Use of Probiotics in Aquaculture.pptxUse of Probiotics in Aquaculture.pptx
Use of Probiotics in Aquaculture.pptx
AKSHAY MANDAL104 views
Ch. 7 Political Participation and Elections.pptx by Rommel Regala
Ch. 7 Political Participation and Elections.pptxCh. 7 Political Participation and Elections.pptx
Ch. 7 Political Participation and Elections.pptx
Rommel Regala105 views
Education and Diversity.pptx by DrHafizKosar
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptx
DrHafizKosar177 views
How to empty an One2many field in Odoo by Celine George
How to empty an One2many field in OdooHow to empty an One2many field in Odoo
How to empty an One2many field in Odoo
Celine George72 views
11.30.23 Poverty and Inequality in America.pptx by mary850239
11.30.23 Poverty and Inequality in America.pptx11.30.23 Poverty and Inequality in America.pptx
11.30.23 Poverty and Inequality in America.pptx
mary850239167 views

Accelerating PIV using hybrid architectures

  • 1. Accelerating Particle Image Velocimetry using Hybrid Architectures Vivek Venugopal, Cameron D. Patterson, Kevin Shinpaugh vivekv@vt.edu, cdp@vt.edu, kashin@vt.edu Introduction t (16 x16) or (32 x 32) or FFT-based PIV algorithm Cardio Vascular Disease (CVD) 5% 4%3% (64 x 64) Motion • Each image is broken down into overlapping Cancer 6% 35% zones and the corresponding zones from both Other Vector Chronic Lung Respiratory Disease Image 1 images are correlated. Accidents 24% t + dt FFT • The peak detection is done using the FFT Alzheimer’s Disease zone 1 block, which is then translated to depict the Diabetes 23% Multiplication IFFT Reduction direction of the particles within the zone. • The reduction block consists of a sub-pixel Cause of Deaths in United States for 2005 FFT algorithm and a filtering routine to compute Image 2 • The Advanced Experimental Thermofluids Engineering zone 2 the velocity value. (AEThER) Lab at Virginia Tech is involved in the area of cardiovascular fluid dynamics and stent hemodynamics, which is instrumental for designing intravascular stents for treating CVD. 325 Apple Mac Pro nodes 2 quad-core Intel Xeon processors with Implementation platforms and Results zone_create • The AEThER lab uses the Particle Image Velocimetry 8 GB RAM per node running Linux real2complex 1.25 (PIV) technique to track the motion of particles in an CentOS complex_mul complex_conj CPU GPU System G platform illuminated flow field. FPGA CUDA kernel functions conj_symm 1.00 1.05 GPU c2c_radix2_sp Host (CPU) Device (GPU) find_corrmax Time in seconds Multiprocessor 30 Multiprocessor 2 Grid transpose 0.75 Multiprocessor 1 Block (0,0) Block (0,1) Block (0,2) x_field 0.65 Shared Memory kernel 1 Block Block Block y_field Registers Registers Registers (1,0) (1,1) (1,2) indx_reorder 0.50 Instruction CUDA 2.1 with memcopy Processor 1 Processor 2 Processor 8 Unit Block (1,1) meshgrid_x CUDA 2.2 with memcopy Thread Thread Thread Thread meshgrid_y CUDA 2.2 with zerocopy 0.34 0.25 Constant (0,0) (0,1) (0,2) (0,3) Cache Thread (1,0) Thread (1,1) Thread (1,2) Thread (1,3) fft_indx Texture Cache Thread Block (2,0) Thread Thread Block (2,1) (2,2) Thread Block (2,3) cc_indx (0,0) (0,1) (0,2) memcopy kernel 2 0 Location of Pressure distribution GPU memory Block (1,0) Block (1,1) Block (1,2) 1.000 26.591 707.107 18803.015 500000.000 Execution Device Region of Interest within the heart GPU time in usecs CUDA hardware model for Tesla C1060 CUDA programming model CUDA profile graph comparison Execution time comparison SFG representation of application Case 1 algorithm PRO-PART + Conclusion Design flow ML310 board 1 ML310 board 2 component specification of Stent Each case = Aurora switches Aurora switches implementation platform Case 2 NOFIS platform configuration 1 1250 image pairs FSL FSL Input specifications Aurora • Nvidia's Tesla C1060 GPU provides computational speedup as compared to both PE1 PE2 PE1 PE2 x 5 MB = 6.25 GB Aurora switches Aurora switches Aurora switches Aurora switches PIV FSL FSL Stent experimental configuration 2 the sequential CPU implementation and the NOFIS implementation for the PIV PE3 PE4 PE3 PE4 SAFC dataflow structure and setup capture using SFG components specification Case 100 application. Aurora switches Aurora switches • The synchronization and the latency in data movement can be optimized between Aurora Aurora ML310 board 4 ML310 board 3 partitioning and Stent Aurora switches Aurora switches communication resource specification the FPGAs by having custom communication interfaces without an Operating configuration 20 FSL FSL automated PE1 PE2 Aurora PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches System (OS) overhead. FSL FSL configure and generate Time for FlowIQ analysis of one image pair on 2GHz values for communication cores • The PRO-PART design flow facilitates a fast and easier mapping of a streaming PE3 PE4 PE3 PE4 Xeon processor = 16 minutes. So time required for Aurora switches Aurora switches complete dataset, 1250 image pairs x 100 cases x 20 mapping to hardware application on the specialized NOFIS platform with a significant advantage in stent configurations = 2.6 years Multi-Core SAFC hardware: NOFIS platform PRO-PART design flow development time. References [1] American Heart Association. (Last Accessed: February 2009) Cardiovascular Disease Statistics. [Online]. Available: http://www.americanheart.org/presenter.jhtml?identifier=4478 [2] A. Eckstein, J. Charonko, and P.Vlachos. Phase Correlation Processing for DPIV Measurements: Part 1 Spatial Domain Analysis. In Proceedings of FEDSM2007,5th Joint ASME/JSME Fluids Engineering Conference, number FEDSM2007-37286, San Diego, California, August 2007. ASME. [3] Nvidia Inc. (Last Accessed: February 2009) Nvidia Tesla C1060 GPU Computing Processor. [Online]. Available: http://www.nvidia.com/object/product_tesla_c1060_us.html