Filipo Novo Mór
Advisors:
Dr. César Augusto Missio Marcon
Dr. Andrew Rau-Chaplin
GPU Performance Prediction
Using High-level Application
Models
ERAD 2014 presentation
2014 March
Pontifical Catholic University of Rio Grande do Sul
Faculty of Informatics
Postgraduate Programme in Computer Science
Outline
• Objectives
• Related Works
• Graphic Processor Units
• Methodology
• Performance Prediction Engine
• Work Schedule
Objectives
• To model applications in high-level in order to
predict their behaviour when running on GPU.
– Secondary goals:
• To create a description of a high-level model for the target
GPU architecture.
• To evaluate the impact of using different cache sizes on
the tested applications
3 / 17
Related Works
• Theoretical works:
app. arch. CUDA HLRA
An Adaptive Performance Modeling Tool for GPU
Architectures
Baghsorkhi et all no yes
source
code
no
performance prediction and bottleneck
indicators
Cross-architecture Performance Predictions for
Scientific Applications Using Parameterized Models
Marin and Mellor-
Crummey
yes yes
source
code
no performance prediction
An Analytical Model for a GPU Architecture with
Memory-level and Thread-level Parallelism Awareness
Hong and Kim no no
source
code
no
performance prediction. Also proposed
two new metrics for GPU modelling,
MWP and CWP
Exploring the multiple-GPU design space Schaa and Kaeli no yes
source
code
no performance benchmark
A Quantitative Performance Analysis Model for GPU
Architectures
Zhang and Owens no yes
source
code
no performance benchmark
yes yes no yes performance prediction
authorswork
modelling inputs
outputs
this work
4 / 17
Related Works
• Application tools:
work authors inputs outputs target architecture
Barra Collange et all CUDA source code execution measurements NVIDIA Tesla
GPU_Sim Bakhoda et all CUDA source code execution measurements NVIDIA Tesla and GT200
GPU Ocelot Diamos et all CUDA source code execution measurements PTX 2.3 (CUDA 4.0)
HLRA execution measurements NVIDIA GK110this work
gpgpu-sim.org
5 / 17
Graphic Processor Unit
Simplified architecture of a NVIDIA GPU
6 / 17
Graphic Processor Unit
Simplified architecture of a NVIDIA GPU showing the
internal sctructure of streaming multiprocessors
7 / 17
Graphic Processor Unit
When a thread block is assigned to a streaming
multiprocessor, it is divided into units called WARPS.
8 / 17
Mohamed Zahran
Graphic Processor Unit
SIMT vs SIMD
• Single Instruction, Multiple Register Sets: each thread has its own register
set, consequently, instructions may process different data simultaneously on
different parallel running threads.
• Single Instruction, Multiple Addresses: each thread is permitted to freely
access non-coalesced memory addresses, given more flexibility to the
programmer. However, this is a unsafe technique because parallel access to
non-coalesced addresses may serialize transactions, which reduce
performance significantly.
• Single Instruction, Multiple Flow Paths: the control flow of different parallel
running threads can diverge.
9 / 17
Graphic Processor Unit
Branch Divergence
10 / 17
Graphic Processor Unit
Branch Divergence
11 / 17
Graphic Processor Unit
The Key Challenges for GPU Programming
• Data transfer between CPU and GPU
• Memory access
• Branch divergence
• No recursion
12 / 17
Methodology
13 / 17
Methodology
Validating
• Applications will be implement in CUDA as well as in
HLRA.
• Applications will be chosen accordind to its profile:
– Computation vs Communication
– Sizing
14 / 17
Performance Prediction Engine
Aspects to be considered by the engine
• Branch divergence
• Memory access
– Local, Global, Shared and thread register block.
• Thread synchronization
• Loops
15 / 17
Work Schedule
16 / 17
Questions
Filipo Novo Mór
filipo.mor at acad.pucrs.br

GPU Performance Prediction Using High-level Application Models

  • 1.
    Filipo Novo Mór Advisors: Dr.César Augusto Missio Marcon Dr. Andrew Rau-Chaplin GPU Performance Prediction Using High-level Application Models ERAD 2014 presentation 2014 March Pontifical Catholic University of Rio Grande do Sul Faculty of Informatics Postgraduate Programme in Computer Science
  • 2.
    Outline • Objectives • RelatedWorks • Graphic Processor Units • Methodology • Performance Prediction Engine • Work Schedule
  • 3.
    Objectives • To modelapplications in high-level in order to predict their behaviour when running on GPU. – Secondary goals: • To create a description of a high-level model for the target GPU architecture. • To evaluate the impact of using different cache sizes on the tested applications 3 / 17
  • 4.
    Related Works • Theoreticalworks: app. arch. CUDA HLRA An Adaptive Performance Modeling Tool for GPU Architectures Baghsorkhi et all no yes source code no performance prediction and bottleneck indicators Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models Marin and Mellor- Crummey yes yes source code no performance prediction An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness Hong and Kim no no source code no performance prediction. Also proposed two new metrics for GPU modelling, MWP and CWP Exploring the multiple-GPU design space Schaa and Kaeli no yes source code no performance benchmark A Quantitative Performance Analysis Model for GPU Architectures Zhang and Owens no yes source code no performance benchmark yes yes no yes performance prediction authorswork modelling inputs outputs this work 4 / 17
  • 5.
    Related Works • Applicationtools: work authors inputs outputs target architecture Barra Collange et all CUDA source code execution measurements NVIDIA Tesla GPU_Sim Bakhoda et all CUDA source code execution measurements NVIDIA Tesla and GT200 GPU Ocelot Diamos et all CUDA source code execution measurements PTX 2.3 (CUDA 4.0) HLRA execution measurements NVIDIA GK110this work gpgpu-sim.org 5 / 17
  • 6.
    Graphic Processor Unit Simplifiedarchitecture of a NVIDIA GPU 6 / 17
  • 7.
    Graphic Processor Unit Simplifiedarchitecture of a NVIDIA GPU showing the internal sctructure of streaming multiprocessors 7 / 17
  • 8.
    Graphic Processor Unit Whena thread block is assigned to a streaming multiprocessor, it is divided into units called WARPS. 8 / 17 Mohamed Zahran
  • 9.
    Graphic Processor Unit SIMTvs SIMD • Single Instruction, Multiple Register Sets: each thread has its own register set, consequently, instructions may process different data simultaneously on different parallel running threads. • Single Instruction, Multiple Addresses: each thread is permitted to freely access non-coalesced memory addresses, given more flexibility to the programmer. However, this is a unsafe technique because parallel access to non-coalesced addresses may serialize transactions, which reduce performance significantly. • Single Instruction, Multiple Flow Paths: the control flow of different parallel running threads can diverge. 9 / 17
  • 10.
  • 11.
  • 12.
    Graphic Processor Unit TheKey Challenges for GPU Programming • Data transfer between CPU and GPU • Memory access • Branch divergence • No recursion 12 / 17
  • 13.
  • 14.
    Methodology Validating • Applications willbe implement in CUDA as well as in HLRA. • Applications will be chosen accordind to its profile: – Computation vs Communication – Sizing 14 / 17
  • 15.
    Performance Prediction Engine Aspectsto be considered by the engine • Branch divergence • Memory access – Local, Global, Shared and thread register block. • Thread synchronization • Loops 15 / 17
  • 16.
  • 17.