A Methodology to Develop High Performance    Applications on GPGPU Architectures:Application to Simulation of Electrical M...
Context: Numerical Methods                Physics                    Software                         ArchitectureApril 3,...
Context: Problem          How to help non-specialists in          programming/architecture to develop their          appli...
Context: GPGPU          General-Purpose Computation on GPUs               Massively Parallel Processing                   ...
Context: GPGPU (Programming)          CUDA               Nvidia’s solution               1st real high level GPGPU program...
Context: Related Work          Code-Level Specification               Directives                   – PGI, OpenHMPP, Annota...
Context: High Level Specification          How to separate the algorithm and the hardware          specifications         ...
Contribs: Methodology          Points of View (according to MARTE)               MARTE Specification                      ...
Contribs: Methodology          Code Generation from Higher Level Models          (Gaspard2)                               ...
Contribs: Building a Model          This is the model designer’s point of view                                          Al...
Contribs: Building an Application          Application: Code_CARMEL(L2EP-          lab/EDF)               • Electrical Mac...
Contribs: CG Example          Solver: Conjugate Gradient               • Numerical Method to solve a System of Linear     ...
Contribs: CG Example          UML/MARTE modeling tool          (Eclipse/Papyrus)        Application                       ...
Contribs: CG Example          The whole algorithm                                                                         ...
Contribs: CG Example          Defining the Architecture        Application       Architecture         Allocation       Dep...
Contribs: CG Example          Allocating Tasks        Application       Architecture         Allocation       Deployment3 ...
Contribs: CG Example          Allocating FlowPorts        Application       Architecture         Allocation       Deployme...
Contribs: CG Example          Elementary Tasks and Software IP        Application      Sip_Mult                           ...
Contribs: Execution Test and   Results          The model designer starts the code          generation          The model ...
Contribs: Execution Test and   Results          CG Program to CG Module for          Code_CARMEL: Adaptation              ...
Contribs: Execution Test and   Results          Evaluating Scalability: FEM on different          meshes3 avril 2012      ...
Contribs: Execution Test and   Results                                                    CPU: AMD Opteron, 8-core        ...
Contribs: How It Works          This is the methodology provider’s point of          view (the UML/MARTE-to-OpenCL chain) ...
Contribs: UML/MARTE to OpenCL                 UML-to-MARTE Transformation                    • avoids the UML complexity  ...
Contribs: UML/MARTE to OpenCL               Tiler-to-Task Transformation                • Expressed in ArrayOL as stereoty...
Contribs: UML/MARTE to OpenCL               Local and Global Graphs Transformations               Scheduling Policy Transf...
Contribs: UML/MARTE to OpenCL               Memory Mapping Transformation                                                 ...
Contribs: UML/MARTE to OpenCL                    Hybrid Transformation                                                    ...
Contribs: UML/MARTE to OpenCL          Code Generation Model to Text          Transformation               Based on Accele...
Contribs: UML/MARTE to OpenCL          Code Generation Model to Text          Transformation                              ...
Contribs: UML/MARTE to OpenCL          Code Generation Model to Text          Transformation               • Tiler Analysi...
Contribs: Profiling Analysis          Integrating Profiler and Models                 High Level Abstraction          7   ...
Contribs: Profiling Analysis          Integrating Profiler and Models (Case          Study)                   {16,1000000}...
Contribs: Profiling Analysis          Integrating Profiler and Models (Case          Study)                               ...
Experimental Validation: Alternator from Valeo          Generated Code for PCG in          Code_CARMEL for an industrial a...
Experimental Validation: Alternator from Valeo          Sparse Matrix                  • N=775,689                  • NNZ=...
Conclusions and Perspectives          Developing Methodology               • Non-specialists can develop their application...
Conclusions and Perspectives          GPU Clusters               For instance, Tianhe in China               MPI as soluti...
Upcoming SlideShare
Loading in...5
×

Thesis Defense

990

Published on

This is my thesis defense presentation. The overall subject is related to code generation to OpenCL in GPGPU architectures.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
990
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • ----- Meeting Notes (19/01/12 11:16) -----slides bienconclusion valeo/region ils ont paye!!!au sein de lequipeinvertir les logos
  • Thesis Defense

    1. 1. A Methodology to Develop High Performance Applications on GPGPU Architectures:Application to Simulation of Electrical Machines THESIS DEFENSE Antonio Wendell DE OLIVEIRA RODRIGUES Advisor : Jean-Luc DEKEYSER Co-Advisor : Frédéric GUYOMARCH
    2. 2. Context: Numerical Methods Physics Software ArchitectureApril 3, 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 2 of 38
    3. 3. Context: Problem How to help non-specialists in programming/architecture to develop their applications How to generate automatic code enough efficient w.r.t. manually written code Taking advantage of available resources How to integrate profiling and development tools3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 3 of 38
    4. 4. Context: GPGPU General-Purpose Computation on GPUs Massively Parallel Processing Physics Software Tsubame 2.0 958 MFlops/watt Cielo Cray Architecture Green Computing 278 MFlops/watt GPGPU3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 4 of 38
    5. 5. Context: GPGPU (Programming) CUDA Nvidia’s solution 1st real high level GPGPU programming Large number of applications, libraries, developers Achieves better performance on Nvidia’s hardware OpenCL (Open Computing Language) Open Standard proposed by Khronos GroupTM Multi-vendors (including Nvidia) Not only for GPUs3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 5 of 38
    6. 6. Context: Related Work Code-Level Specification Directives – PGI, OpenHMPP, Annotated C Interfaces, translation – Java, Python, Matlab Specific Language – SAC (Single Assignment C) Simulink Mindstorms » WITH-loop expressions (CUDA Backend) OpenModelica High-Level Specification Simulink, OpenModelica, Mindstorms (Labview) Gaspard: OpenMP Branch (Julien Taillard) • Programming model, Specification3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 6 of 38
    7. 7. Context: High Level Specification How to separate the algorithm and the hardware specifications MDE • specify an application (UML/MARTE) • the expression of its potential parallelism Physics Software • the platform architecture • the link between logical and physical parts Model Driven Engineering • Clear separation between hardware and software specifications • UML: diagrams, tools Architecture • UML profile for MARTE: Parallel expressiveness inspired by ArrayOL GPGPU – enables factorization of repeated elements.3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 7 of 38
    8. 8. Contribs: Methodology Points of View (according to MARTE) MARTE Specification Define Methodology Build Model <<include>> Adapt MARTE specification <<include>> Annotate Model for Analysis Build Execution Platform Model Analyze Model Provide Execution Platform OpenCL, GPU Cards, Drivers, etc.3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 8 of 38
    9. 9. Contribs: Methodology Code Generation from Higher Level Models (Gaspard2) Compilation of Models OpenCL OpenMP Pthread VHDL Program (source code, makefiles, etc.)3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: MEDEE Meeting 9 of 38
    10. 10. Contribs: Building a Model This is the model designer’s point of view Allocation Application Architecture Deployment Virtual IP Software IP Artifacts Global View of a Whole Model3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 10 of 38
    11. 11. Contribs: Building an Application Application: Code_CARMEL(L2EP- lab/EDF) • Electrical Machines Modeling and Simulation • Sparse Matrices TCarmel/FCarmel in CSR format Matrix Assembly PostProcess GenPARAM GenPHYS GenDOF Input Solver Output3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 11 of 38
    12. 12. Contribs: CG Example Solver: Conjugate Gradient • Numerical Method to solve a System of Linear Equations3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 12 of 38
    13. 13. Contribs: CG Example UML/MARTE modeling tool (Eclipse/Papyrus) Application dotProd Architecture A: Real {1000} <<tiler>> Allocation Mult m: Mult {1000} r: Reduc {1} el1: Real {1} <<reshape>> {1} {1} res: Real {1000} {1} C: Real {1} el2: Real {1} <<tiler>> Deployment B: Real {1000}3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 13 of 38
    14. 14. Contribs: CG Example The whole algorithm CG_Module_GPU Application norm_r0: norm r_0 : Real{132651} norm_r0 : Real{1} <<interrepetition>> <<defaultlink>> CGLoop <<interrepetition>> b : Real {132651} r_k : Real{132651} rr: dotProd alpha: ScalarDiv <<shaped>> <<shaped>> cg: CGLoop {132651} Architecture <<defaultlink>> x: DAXPY beta: ScalarDiv r_k : Real{132651} <<shaped>> p: DAXPY error : Real{1} r_k1 : Real{132651} norm_r0 : Real{1} error : Real{1} error: ScalarDivSqrt x_k1 : Real{132651} norm_r0 : Real{1} init: InitVars x_k : Real{132651} r_k1 : Real{132651} <<defaultlink>> x0 : Real{132651} p_k1 : Real{132651} x_k : Real{132651} A : Real{3442951} A : Real{3442951} iA : Integer{132652} x_k1 : Real{132651} <<shaped>> x_out error : Real{1} Ap: dgemvCSR pAp: dotProd minusalpha: Negative <<shaped>> r: DAXPY A : Real{3442951} rrnew: dotProd Allocation iA : Integer{132652} jA : Integer{3442951} iA : Real{132651} jA : Integer{3442951} p_k1 : Real{132651} p_k : Real{132651} p_k : Real{132651} jA : Real{132651} <<interrepetition>> Deployment3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 14 of 38
    15. 15. Contribs: CG Example Defining the Architecture Application Architecture Allocation Deployment3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 15 of 38
    16. 16. Contribs: CG Example Allocating Tasks Application Architecture Allocation Deployment3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 16 of 38
    17. 17. Contribs: CG Example Allocating FlowPorts Application Architecture Allocation Deployment3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 17 of 38
    18. 18. Contribs: CG Example Elementary Tasks and Software IP Application Sip_Mult • Code of an elementary function • Parameter order • Possible header files or Architecture libraries, compiling directives, so on. <<manifest>> Allocation <<artifact>> <<codeFile>> MultCF Deployment3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 18 of 38
    19. 19. Contribs: Execution Test and Results The model designer starts the code generation The model compiler generates a program Makefile and source files CGLoop r_k : Real{132651} rr: dotProd alpha: ScalarDiv <<shaped>> x: DAXPY beta: ScalarDiv <<shaped>> p: DAXPY r_k1 : Real{132651} error: ScalarDivSqrt x_k1 : Real{132651} norm_r0 : Real{1} p_k1 : Real{132651} x_k : Real{132651} <<shaped>> error : Real{1} Ap: dgemvCSR pAp: dotProd minusalpha: Negative <<shaped>> r: DAXPY A : Real{3442951} rrnew: dotProd iA : Integer{132652} jA : Integer{3442951} p_k : Real{132651} <<allocate>> <<abstract>> Architecture <<shaped>> <<shaped>> h1: HOST {1} d1: DEVICE {1} <<allocate>> <<abstract>> mp: Memory mgp: Memory CGLoop r_k : Real{132651} rr: dotProd alpha: ScalarDiv <<shaped>> x: DAXPY beta: ScalarDiv <<shaped>> p: DAXPY r_k1 : Real{132651} error: ScalarDivSqrt x_k1 : Real{132651} norm_r0 : Real{1} p_k1 : Real{132651} x_k : Real{132651} <<shaped>> error : Real{1} Ap: dgemvCSR pAp: dotProd minusalpha: Negative <<shaped>> r: DAXPY A : Real{3442951} rrnew: dotProd iA : Integer{132652} jA : Integer{3442951} p_k : Real{132651} <<allocate>> <<abstract>> Architecture <<shaped>> <<shaped>> h1: HOST {1} d1: DEVICE {1} <<allocate>> <<abstract>> mp: Memory mgp: Memory3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 19 of 38
    20. 20. Contribs: Execution Test and Results CG Program to CG Module for Code_CARMEL: Adaptation GenDOF: Fortran GenPHYS: Fortran C/C++ GenPARAM: Fortran T/FCarmel: Fortran Interface C PostProcessing: Fortran3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 20 of 38
    21. 21. Contribs: Execution Test and Results Evaluating Scalability: FEM on different meshes3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 21 of 38
    22. 22. Contribs: Execution Test and Results CPU: AMD Opteron, 8-core Results @2.4GHz and 64GB RAM. Execution Time GPU: NVidia S1070 4 devices Tesla T10 (4GB RAM each) – Compute Capability 1.3 Performance 13 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 22 of 38
    23. 23. Contribs: How It Works This is the methodology provider’s point of view (the UML/MARTE-to-OpenCL chain) 3 6 9 2 5 8 #include b.h func(a,b){ 1 4 7 c=a+b; }3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 23 of 38
    24. 24. Contribs: UML/MARTE to OpenCL UML-to-MARTE Transformation • avoids the UML complexity • keeps only the essential elements of MARTE Port Instance Transformation • UML does not implement instances of FlowPorts when we instantiate a part (tasks) Mult m: Mult {100} k: Mult {20} el1: Real {1} {1} {1} {1} res: Real {1} {1} el2: Real {1} {1} {1} 13 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 24 of 38
    25. 25. Contribs: UML/MARTE to OpenCL Tiler-to-Task Transformation • Expressed in ArrayOL as stereotype of connectors • Special tasks allocated available processors3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 25 of 38
    26. 26. Contribs: UML/MARTE to OpenCL Local and Global Graphs Transformations Scheduling Policy Transformation globalDependencies p1_Task Start StartTask Task IPTask IPTask vec1 vec2 Global Graph: Global p1_Task Graph contains other dev sub-graphs IPTask v1v2 EndTask End Task3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 26 of 38
    27. 27. Contribs: UML/MARTE to OpenCL Memory Mapping Transformation main 1 2 3 4 addMemoryMap defineScope propagateDataAllocation createTilerTaskDA X 5 defineBasicDataAllocations createAffectationDataAllocation createVirtualIPSoftIPDA3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 27 of 38
    28. 28. Contribs: UML/MARTE to OpenCL Hybrid Transformation main HorizontalFilter VerticalFilter <<shaped>> same allocation <<shaped>> rhf: RHF {288,44} rvf: RVF {32,132} «tiler» «tiler» Thread (work-item) createHybridApp 1 2 3 4 «tiler» «tiler» Grid definition toHybridApp refersTo refersTo 1 2 3 toDevSide toHostSide Schedule Host 4 toKernel toMainFunction Schedule Device kernelVars toIPFunction toTilerFunctions mainVars defineVars optimizeTransfer3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 28 of 38
    29. 29. Contribs: UML/MARTE to OpenCL Code Generation Model to Text Transformation Based on Acceleo Templates Functionalities • IP insertions • Tiler notation to Memory Address Computation in C • Implements the memory transfer optimization3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 29 of 38
    30. 30. Contribs: UML/MARTE to OpenCL Code Generation Model to Text Transformation send(dataaddress) with size data to Device; <<shaped>> Multiple Devices Launch Kernel on Device with grid (WG,WI)p: DAXPY {100} recv(dataaddress) <<hwResource>> <<shaped>> with size data from Device; d1: Device {4} for (i = 0; i < numDev; i++) gp: GPU mgp: Memory send(dataaddress + i*data/numDev) with size data/numDev to Device i; for (i = 0; i < numDev; i++) <<abstraction>><<allocate>> Launch Kernel on Device i with grid (WG/numDev,WI) for (i = 0; i < numDev; i++) recv(dataaddress + i*data/numDev) with size data/numDev from Device i;3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 30 of 38
    31. 31. Contribs: UML/MARTE to OpenCL Code Generation Model to Text Transformation • Tiler Analysis (Shared Memory Use)3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 31 of 38
    32. 32. Contribs: Profiling Analysis Integrating Profiler and Models High Level Abstraction 7 Profiling and Model of Application, Advice Profiling and Architecture and Allocation Optimization Hints Model Transformation Vincent Aranega’s Annotations Profiling and Advices 6 1 Chain Thesis (2011) Model Production Domain Specific Profiling Analisys Transformation Library Generated Code Files Trace (Makefile, *.cl, *.cpp, *.h) Models Profiling Log Device Features Model Database Model SDK 2 Compilation Process UID base link 5 Log Parser Binaries and Runtime Files Logs Software 3 Execution Profiling Logs Production 4 Hybrid Execution Platform3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 32 of 38
    33. 33. Contribs: Profiling Analysis Integrating Profiler and Models (Case Study) {16,1000000}3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 33 of 38
    34. 34. Contribs: Profiling Analysis Integrating Profiler and Models (Case Study) ~ 60%3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 34 of 38
    35. 35. Experimental Validation: Alternator from Valeo Generated Code for PCG in Code_CARMEL for an industrial application3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 35 of 38
    36. 36. Experimental Validation: Alternator from Valeo Sparse Matrix • N=775,689 • NNZ=12,502,443 Solution: Preconditioned Conjugate Gradient (PCG) in 10,000 iterations Time (s) Speedup CPU (AMD Opteron) 2300 (~38min) 1 GPU (S1070) 250 (~4min) 9.23 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 36 of 38
    37. 37. Conclusions and Perspectives Developing Methodology • Non-specialists can develop their applications from higher levels specification Optimizations and MultiGPU • Memory Issues: Efficient code • Profiling Integration • Scaling according to hardware Numerical Methods (Industrial Applications) • Speedups > 9x • Multiple Simulations – 10 hours/simulation ~ 1 hour • High Performance with low investment in hardware Code_CARMEL Integration3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 37 of 38
    38. 38. Conclusions and Perspectives GPU Clusters For instance, Tianhe in China MPI as solution for inter-node communication • Issues: distributed memory, communication, synchronization High-Level Control on the Code Generation Chain • Optimization levels, dynamic parameters3 avril 2012 Wendell Rodrigues MDE Methodology for GPGPU: Thesis Defense 38 of 38

    ×