SlideShare a Scribd company logo
1 of 23
GPU
   acceleration of
   image
    processing       Jan
                     Lemeire

                               1

15/11/2012
GPU vs CPU Peak Performance Trends
2010
350 Million triangles/second
 GPU peak performance has    grown aggressively.
3 Billion transistors GPU
   Hardware has kept up with Moore’s law




             1995
             5,000 triangles/second
             800,000 transistors GPU




                                       Source : NVIDIA   3
To the rescue: Graphical Processing Units
   (GPUs)



                                     Many-core GPU




    94 fps (AMD Tahiti Pro)

    GPU: 1-3 TeraFlop/second      Multi-core CPU

       instead of 10-20 GigaFlop/second for CPU
                                        Courtesy: John Owens
Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs.
                               15/11/2012                 4
GPUs
              are an alternative for CPUs
             in offering processing power



15/11/2012                              6
pixel rescaling   lens correction   pattern detection


  CPU gives only 4 fps
  next generation machines need 50fps
     15/11/2012                                         7
CPU: 4 fps        GPU: 70 fps




     15/11/2012                 8
Methodology
                      Application



                     Identification of
                  compute-intensive parts



                    Feasibility study of
                     GPU acceleration




                   GPU implementation




                     GPU optimization




                       Hardware



     15/11/2012                             9
Obstacle 1
                Hard(er) to implement




15/11/2012                          10
GPU Programming Concepts
             Device/GPU        1TFLOPS
                                                                                                    Grid (1D, 2D or 3D)
                                                                                                                                      kernel
              Multiprocessor 1           Multiprocessor 2
                                                                                                      get_local_size(0)




                                                                 get_local_size(1)
               Local Memory (16/48KB)         Local Memory                                           Group      Group      Group
                                                                                                     (0, 0)     (1, 0)     (2, 0)
                40GB/s     few cycles
               Private       Private     Private     Private
Host/                                                                                                Group      Group      Group
                16K/8
CPU                                                                                                  (0, 1)     (1, 1)     (2, 1)
                 Scalar
                               Scalar      Scalar      Scalar
               Processor
Proces-                      Processor   Processor   Processor
                  1GHz
  sor
            100GB/s 200 cycles                                                                        Work group




                                                                              Work group size Sy
 R            Global Memory (1GB)                                                                   (get_group_id(0),get_group_id(1))
 A                                                                                                    Work item Work item Work item
 M            Constant Memory (64KB)                                                                   (0, 0)    (1, 0)    (2, 0)
                                                                                                      Work item Work item Work item
                                                                                                       (0, 1)    (1, 1)    (2, 1)
              Texture Memory (in global memory)
                                                                                                      Work item Work item Work item
        4-8 GB/s                                                                                       (0, 2)    (1, 2)    (2, 2)


                                                                                                              Work group size Sx
 Max #work items per work group: 1024                                                              (get_local_id(0), get_local_id(1))
 Executed in warps/wavefronts of 32/64 work items
 Max work groups simultaneously on MP: 8
 Max active warps on MP: 24/48
              15/11/2012                                                                                             OpenCL terminology
                                                                                                                                   11
Semi-abstract scalable hardware model


  Need to know more     Code remains
  details than of CPU   compatible/efficient




 Need to know model for effective and efficient
  code
 CPU: processor ensures efficient execution




        15/11/2012                                 12
Increased code complexity
1. Complex index calculations
      Mapping data elements on processing elements (at
       least 2 levels)
      Sometimes better to group elements
2. Optimizations
      Impact on performance need to be tested
3. A lot of parameters:
  a.   Algorithm, implementation
  b.   Configuration of mapping
  c.   Hardware parameters (limits)
  d.   Optimized versions

        15/11/2012                                        13
Methodology
                               Application



                              Identification of
                           compute-intensive parts

      Parallelization by
          compiler
                             Feasibility study of
                              GPU acceleration
        Pragma-based


       Skeleton-based       GPU implementation


          OpenCL
                              GPU optimization




                                Hardware



     15/11/2012                                      14
Obstacle 2
               Hard(er) to get efficiency




15/11/2012                              15
 We expect peak performance
   Speedup of 100x possible
 At least, we expect some speedup
   But what is 5x worth?


 Reasons for low efficiency?




      15/11/2012                     16
Roofline model




15/11/2012                    17
15/11/2012   18
Methodology: our contribution
                                  Application



                                 Identification of                                         Anti-parallel
                                                           Algorithm characterization
                              compute-intensive parts                                       patterns

Parallelization by
    compiler
                                Feasibility study of    Performance                     Roofline model
                                 GPU acceleration        estimation                     & benchmarks
 Pragma-based


                                                        Performance                       Analytical
Skeleton-based                 GPU implementation
                                                          analysis                         model


    OpenCL
                                                        bottlenecks &
                                 GPU optimization                                        benchmarks
                                                          trade-offs



                                                                   Hardware
                                   Hardware                     characterization



                 15/11/2012                                                                      19
Conclusions


15/11/2012             20
Conclusions




                 Changed into…

    15/11/2012                   21
Conclusions




    15/11/2012   22
Competence Center for Personal
           Supercomputing

 Offer trainings (overcome obstacle 1)
   Acquire expertise
   Take an independent, critical position

 Offer feasibility and performance studies
  (overcome obstacle 2)


  Symposium: Brussels, December 13th 2012
                   http://parallel.vub.ac.be
      15/11/2012                               23

More Related Content

What's hot

Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
Steve Watt
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_science
eNovance
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Stefano Di Carlo
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
karl.barnes
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 

What's hot (18)

淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道 淺談探索 Linux 系統設計之道
淺談探索 Linux 系統設計之道
 
Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11Nvidia Cuda Apps Jun27 11
Nvidia Cuda Apps Jun27 11
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_science
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
CUDA
CUDACUDA
CUDA
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 

Viewers also liked (12)

Challenge for IT person
Challenge for IT personChallenge for IT person
Challenge for IT person
 
iMinds The Conference 2012: Erik Mannens
iMinds The Conference 2012: Erik MannensiMinds The Conference 2012: Erik Mannens
iMinds The Conference 2012: Erik Mannens
 
Using GALILEO
Using GALILEOUsing GALILEO
Using GALILEO
 
Keynote Presentation of Wim De Waele at iMinds The Conference 2014
Keynote Presentation of Wim De Waele at iMinds The Conference 2014Keynote Presentation of Wim De Waele at iMinds The Conference 2014
Keynote Presentation of Wim De Waele at iMinds The Conference 2014
 
Resultados colindres
Resultados colindresResultados colindres
Resultados colindres
 
A test on speed and accuracy (decimals)
A test on speed and accuracy (decimals)A test on speed and accuracy (decimals)
A test on speed and accuracy (decimals)
 
Worksheet on Decimals for Elementary students
Worksheet on Decimals for Elementary studentsWorksheet on Decimals for Elementary students
Worksheet on Decimals for Elementary students
 
Accio - Presentation Closingevent
Accio - Presentation ClosingeventAccio - Presentation Closingevent
Accio - Presentation Closingevent
 
Voh brochure both sides
Voh brochure both sidesVoh brochure both sides
Voh brochure both sides
 
UXO FINAL
UXO FINALUXO FINAL
UXO FINAL
 
iMinds The Conference: Jarmo Eskelinen
iMinds The Conference: Jarmo EskelineniMinds The Conference: Jarmo Eskelinen
iMinds The Conference: Jarmo Eskelinen
 
Site selection-c riteria
Site selection-c riteriaSite selection-c riteria
Site selection-c riteria
 

Similar to iMinds The Conference: Jan Lemeire

Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Block
oscon2007
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
chon2010
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Fisnik Kraja
 
Commandlistsiggraphasia2014 141204005310-conversion-gate02
Commandlistsiggraphasia2014 141204005310-conversion-gate02Commandlistsiggraphasia2014 141204005310-conversion-gate02
Commandlistsiggraphasia2014 141204005310-conversion-gate02
RubnCuesta2
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Oracle Database Appliance - RAC in a box Some strings attached
Oracle Database Appliance - RAC in a box Some strings attached Oracle Database Appliance - RAC in a box Some strings attached
Oracle Database Appliance - RAC in a box Some strings attached
Fuad Arshad
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Tommy Lee
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users Conference
Isaac Christoffersen
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 

Similar to iMinds The Conference: Jan Lemeire (20)

Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Topics - , Addressing modes, GPU, .pdf
Topics - , Addressing modes, GPU,  .pdfTopics - , Addressing modes, GPU,  .pdf
Topics - , Addressing modes, GPU, .pdf
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Block
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
Commandlistsiggraphasia2014 141204005310-conversion-gate02
Commandlistsiggraphasia2014 141204005310-conversion-gate02Commandlistsiggraphasia2014 141204005310-conversion-gate02
Commandlistsiggraphasia2014 141204005310-conversion-gate02
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Oracle Database Appliance - RAC in a box Some strings attached
Oracle Database Appliance - RAC in a box Some strings attached Oracle Database Appliance - RAC in a box Some strings attached
Oracle Database Appliance - RAC in a box Some strings attached
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Vizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users ConferenceVizuri Exadata East Coast Users Conference
Vizuri Exadata East Coast Users Conference
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 

More from imec

More from imec (20)

Flipped Knowledge Transfer Model
Flipped Knowledge Transfer ModelFlipped Knowledge Transfer Model
Flipped Knowledge Transfer Model
 
FIWARE projects 2015-1
FIWARE projects 2015-1FIWARE projects 2015-1
FIWARE projects 2015-1
 
iMinds - Health: Gebruikscenarios Presentatie
iMinds - Health:  Gebruikscenarios PresentatieiMinds - Health:  Gebruikscenarios Presentatie
iMinds - Health: Gebruikscenarios Presentatie
 
Keynote Presentation of Dries Buytaert at iMinds The Conference 2014
Keynote Presentation of Dries Buytaert at iMinds The Conference 2014Keynote Presentation of Dries Buytaert at iMinds The Conference 2014
Keynote Presentation of Dries Buytaert at iMinds The Conference 2014
 
Keynote Presentation of Bart Becks at iMinds The Conference 2014
Keynote Presentation of Bart Becks at iMinds The Conference 2014Keynote Presentation of Bart Becks at iMinds The Conference 2014
Keynote Presentation of Bart Becks at iMinds The Conference 2014
 
Keynote Presentation of Jürgen Ingels at iMinds The Conference 2014
Keynote Presentation of Jürgen Ingels at iMinds The Conference 2014Keynote Presentation of Jürgen Ingels at iMinds The Conference 2014
Keynote Presentation of Jürgen Ingels at iMinds The Conference 2014
 
SoftKinetic
SoftKinetic SoftKinetic
SoftKinetic
 
Graphine: Stay Ahead with Advanced Texture Streaming
 Graphine: Stay Ahead with Advanced Texture Streaming Graphine: Stay Ahead with Advanced Texture Streaming
Graphine: Stay Ahead with Advanced Texture Streaming
 
Peter Schelkens on High-Tech Visualization
Peter Schelkens on High-Tech VisualizationPeter Schelkens on High-Tech Visualization
Peter Schelkens on High-Tech Visualization
 
POSIOS: Beyond Point of Sale – The Story
POSIOS: Beyond Point of Sale – The StoryPOSIOS: Beyond Point of Sale – The Story
POSIOS: Beyond Point of Sale – The Story
 
iMinds & SME Innovation
iMinds & SME InnovationiMinds & SME Innovation
iMinds & SME Innovation
 
Jeroen Cant on IWT Baekeland
Jeroen Cant on IWT BaekelandJeroen Cant on IWT Baekeland
Jeroen Cant on IWT Baekeland
 
Researchers in Residence Infosession
Researchers in Residence InfosessionResearchers in Residence Infosession
Researchers in Residence Infosession
 
Ontoforce: A Testimonial of a Start-up Company
Ontoforce: A Testimonial of a Start-up CompanyOntoforce: A Testimonial of a Start-up Company
Ontoforce: A Testimonial of a Start-up Company
 
IWT Baekeland and Innovation Mandates
IWT Baekeland and Innovation MandatesIWT Baekeland and Innovation Mandates
IWT Baekeland and Innovation Mandates
 
Bridging Business and Research
Bridging Business and ResearchBridging Business and Research
Bridging Business and Research
 
Hosting Personal R&D Mandates in Support of Company's R&D Road Map and Intern...
Hosting Personal R&D Mandates in Support of Company's R&D Road Map and Intern...Hosting Personal R&D Mandates in Support of Company's R&D Road Map and Intern...
Hosting Personal R&D Mandates in Support of Company's R&D Road Map and Intern...
 
Presentation by Thomas Kallstenius for Internet of Things Session
Presentation by Thomas Kallstenius for Internet of Things SessionPresentation by Thomas Kallstenius for Internet of Things Session
Presentation by Thomas Kallstenius for Internet of Things Session
 
Internet of Everything (IoE): Driving Industry Disruption
Internet of Everything (IoE): Driving Industry DisruptionInternet of Everything (IoE): Driving Industry Disruption
Internet of Everything (IoE): Driving Industry Disruption
 
The Internet of Things & hue
The Internet of Things & hueThe Internet of Things & hue
The Internet of Things & hue
 

iMinds The Conference: Jan Lemeire

  • 1. GPU acceleration of image processing Jan Lemeire 1 15/11/2012
  • 2.
  • 3. GPU vs CPU Peak Performance Trends 2010 350 Million triangles/second  GPU peak performance has grown aggressively. 3 Billion transistors GPU  Hardware has kept up with Moore’s law 1995 5,000 triangles/second 800,000 transistors GPU Source : NVIDIA 3
  • 4. To the rescue: Graphical Processing Units (GPUs) Many-core GPU  94 fps (AMD Tahiti Pro)  GPU: 1-3 TeraFlop/second Multi-core CPU instead of 10-20 GigaFlop/second for CPU Courtesy: John Owens Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs. 15/11/2012 4
  • 5.
  • 6. GPUs are an alternative for CPUs in offering processing power 15/11/2012 6
  • 7. pixel rescaling lens correction pattern detection CPU gives only 4 fps next generation machines need 50fps 15/11/2012 7
  • 8. CPU: 4 fps GPU: 70 fps 15/11/2012 8
  • 9. Methodology Application Identification of compute-intensive parts Feasibility study of GPU acceleration GPU implementation GPU optimization Hardware 15/11/2012 9
  • 10. Obstacle 1 Hard(er) to implement 15/11/2012 10
  • 11. GPU Programming Concepts Device/GPU 1TFLOPS Grid (1D, 2D or 3D) kernel Multiprocessor 1 Multiprocessor 2 get_local_size(0) get_local_size(1) Local Memory (16/48KB) Local Memory Group Group Group (0, 0) (1, 0) (2, 0) 40GB/s few cycles Private Private Private Private Host/ Group Group Group 16K/8 CPU (0, 1) (1, 1) (2, 1) Scalar Scalar Scalar Scalar Processor Proces- Processor Processor Processor 1GHz sor 100GB/s 200 cycles Work group Work group size Sy R Global Memory (1GB) (get_group_id(0),get_group_id(1)) A Work item Work item Work item M Constant Memory (64KB) (0, 0) (1, 0) (2, 0) Work item Work item Work item (0, 1) (1, 1) (2, 1) Texture Memory (in global memory) Work item Work item Work item 4-8 GB/s (0, 2) (1, 2) (2, 2) Work group size Sx Max #work items per work group: 1024 (get_local_id(0), get_local_id(1)) Executed in warps/wavefronts of 32/64 work items Max work groups simultaneously on MP: 8 Max active warps on MP: 24/48 15/11/2012 OpenCL terminology 11
  • 12. Semi-abstract scalable hardware model Need to know more Code remains details than of CPU compatible/efficient  Need to know model for effective and efficient code  CPU: processor ensures efficient execution 15/11/2012 12
  • 13. Increased code complexity 1. Complex index calculations  Mapping data elements on processing elements (at least 2 levels)  Sometimes better to group elements 2. Optimizations  Impact on performance need to be tested 3. A lot of parameters: a. Algorithm, implementation b. Configuration of mapping c. Hardware parameters (limits) d. Optimized versions 15/11/2012 13
  • 14. Methodology Application Identification of compute-intensive parts Parallelization by compiler Feasibility study of GPU acceleration Pragma-based Skeleton-based GPU implementation OpenCL GPU optimization Hardware 15/11/2012 14
  • 15. Obstacle 2 Hard(er) to get efficiency 15/11/2012 15
  • 16.  We expect peak performance  Speedup of 100x possible  At least, we expect some speedup  But what is 5x worth?  Reasons for low efficiency? 15/11/2012 16
  • 19. Methodology: our contribution Application Identification of Anti-parallel Algorithm characterization compute-intensive parts patterns Parallelization by compiler Feasibility study of Performance Roofline model GPU acceleration estimation & benchmarks Pragma-based Performance Analytical Skeleton-based GPU implementation analysis model OpenCL bottlenecks & GPU optimization benchmarks trade-offs Hardware Hardware characterization 15/11/2012 19
  • 21. Conclusions Changed into… 15/11/2012 21
  • 22. Conclusions 15/11/2012 22
  • 23. Competence Center for Personal Supercomputing  Offer trainings (overcome obstacle 1)  Acquire expertise  Take an independent, critical position  Offer feasibility and performance studies (overcome obstacle 2) Symposium: Brussels, December 13th 2012 http://parallel.vub.ac.be 15/11/2012 23

Editor's Notes

  1. First, we have to understand where it comes from, the tremendous computational power of GPU. The CPU is capable of running a(ny) sequential program very fast. The GPU has a lot of processing units, but programming them requires more care.Map part of the computational work on processing elementDescribe by kernelKernel executed by a ‘thread’E.g. image processing: pixel is work unit
  2. Case of KLA Tencor (ICOS – Leuven): inspection machines needing real-time image processing
  3. Re-implementation of algorithms is required…
  4. On the left the abstract hardware model and on the right the execution model. Both should be understood in order to write OpenCL programs. This contrasts with the simple Von Neumann model used for CPUs.
  5. Our focus is on OpenCL programming and not high-level solutions that generate GPU programs. Those solutions are, in my opinion, not mature yet.
  6. Is 5x worth the effort of porting to GPUs?
  7. Roofline model gives which resource bounds the overall performance
  8. After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.
  9. After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.