Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Unai Lopez Novoa
19 June 2015
Phd Dissertation
Advisors: Jose Miguel-Alonso & Alexander Mendiburu
Contributions to the Efficient Use of
General Purpose Coprocessors:
Kernel Density Estimation as Case Study

Outline
• Introduction
• Contributions
1) A Survey of Performance Modeling and Simulation Techniques
2) S-KDE: An Efficient Algorithm for Kernel Density Estimation
• And its implementation for Multi and Many-cores
1) Implementation of S-KDE in General Purpose Coprocessors
2) A Methodology for Environmental Model Evaluation based on S-KDE
• Conclusions
2

High Performance Computing
• Branch of computer science related to the use of parallel
architectures to solve complex computational problems
• Today’s fastest supercomputer: Tianhe-2
4Introduction
(China’s National University of Defense and Technology, 33.86 PFLOP/s)

HPC Environments
• Traditional HPC systems were homogeneous, built
around single or multi-core CPUs
• But supercomputers are becoming heterogeneous
5Introduction
(Coprocessor number evolution in the Top500 list over time)

Compute platforms
Introduction 6
Multi-Core
CPUs
• Branch prediction, OoOE
• “Versatile”
Up to
250 GFLOP/s
Graphics
Processing
Units
• Hundreds of cores
• Handle thousands of threads Up to
1.8 TFLOP/s
Many-Core
Processors
• Tens of x86 cores
• HyperThreading
Up to
1 TFLOP/s
Device Features Peak D.P. Performance

Motivation
• Examples of successful porting of applications to
accelerators (compared againts multi-core implementations):
• SAXPY: 11.8x
• Polynomial Equation Solver: 79x
• Image Treatment (MRI): 263x
• …
• … but this is not applicable for every HPC code
Introduction 7
Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a
multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and practice of parallel programming. ACM, 2008. (>700 cites on Google Scholar)

Difficulties using accelerators
• Suitable codes for accelerators should:
• Expose high levels of parallelism
• Have a good spatial/temporal data locality
• …
• Porting a code requires extensive program rewriting
• Development tools for accelerators are not as polished
as those for CPUs
Effectively exploiting the performance of a
coprocessor remains as a challenging task
Introduction 8

Structure of this thesis
Introduction 9
A survey of performance modeling
and simulation techniques
Design of a novel algorithm for
Kernel Density Estimation: S-KDE
S-KDE for Multi & Many-Cores
S-KDE for Accelerators
A methodology for environmental
model evaluation based on S-KDE
Motivation:
Discuss the issues to efficiently use
general purpose coprocessors
Case Study:
Kernel Density Estimation applied
to environmental model evaluation

A Survey of Performance Modeling
and Simulation Techniques
10

Developing for accelerators
• Approaches/aids:
A Survey of Performance Modeling and Simulation Techniques 11
Trial and error
Profilers / Debuggers / …
Performance Models

A survey of models and simulators
• Accelerator & GPGPU trend began ~2005
• First performance models appeared ~2007
• Abundant literature
• No outstanding models or tools

Taxonomy
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators

Model analysis
• We analysed 29 relevant accelerator models
• For each of them we summarized and identified:
• Modeling method (Analytical, Machine Learning,…)
• Target platforms and test devices
• Input preprocessing requirements
• Limitations
• Highlights over other models

The MWP-CWP model
• Presented by Hong & Kim in 2009 (>360 cites in Google Scholar)
• Estimates the execution time of a GPU application
• Based on how Warps are scheduled in NVIDIA GPUs
Test platform Input requirements Limitations HighlightsMethod
Analytical NVIDIA GPUs
(8800GT,…)
Run µbenchmarks
& Parse PTX
Branches not
modeled
Extendable to non-
NVIDIA GPUs

The Roofline model
• Presented by Williams et al. in 2009 (>450 cites in Google Scholar)
• Outstanding model for bottleneck highlighting
• Visual model:
Test platform Input requirements Limitations HighlightsMethod
Analytical Multi-core CPUs
& Accelerators
Run µbenchmarks
& Analyse application
Depends on
architecture
Visual output to guide
optimizations

Performance tools
• Some models require running performance tools
(µbenchmarks, profilers,…)
• We have reviewed them as well

Conclusions
1) There is no accurate model valid for a wide set of
architectures
2) Most models are tied to CUDA
3) There is a growing interest in analyzing power
4) It was impossible to make a comparison of the models
(lack of details, codes, …)

S-KDE: An Efficient Algorithm for
Kernel Density Estimation
(and its implementation for Multi and Many-cores)
19

Case study
• Collaborative Work:
EOLO
UPV/EHU Climate and Meteorology Group
• Scenario:
Environmental Model Evaluation
• Problem:
Excessive execution times of KDE
S-KDE: An Efficient Algorithm for Kernel Density Estimation 20

Kernel Density Estimation
• Statistical technique used to estimate the Probability
Density Function (PDF) of a random variable with
unknown characteristics
• where:
• xi are the samples from the random variables
• K is the Kernel function
• H is the bandwidth value

Kernel function
• Symmetric function that integrates to one
• We classify them according to area of influence
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
Density
x
Gaussian
0
0.2
0.4
0.6
0.8
1
1.2
-1 -0.5 0 0.5 1
Density
x
Epanechnikov
Bounded Unbounded

Bandwidth
• Parameter to control the smoothness of the estimation
• It must be carefully selected
• Common approaches for its selection
• Heuristic as in Silverman, 1986
• Iterative technique, e.g., bootstraping

Computing KDE
Naive approach: EP-KDE
for each eval_point e in E
for each sample s in S
d = distance(e,s)
e += density (d)
Our proposal: S-KDE
for each sample s in S
B = findInfluenceArea(s)
for each eval_point e in B
d = distance(e,s)
e += density (d)
Complexity: O(|E|·|S|) Complexity: O(|B|·|S|)

Delimiting the influence area
• Depends on the Kernel
• Our case: Epanechnikov kernel
• Technique based on a method in Fukunaga, 1990

Chop & Crop
• In spaces of dimensionality 3 and higher, the number of
evaluation points outside the influence area increases
• We developed a technique to further reduce evaluations:
Step 1: Chop the box into slices Step 2: Crop the slice

Example numbers
500k Samples 3D dataset
194M Evaluation point space
EP-KDE: 9.74 * 1013
distance-density evaluations
102461 Evaluation points per Bounding box (average)
S-KDE: 5.12 * 1010
evaluations
With C&C: 53511 Evaluation point per Bounding box (average)
S-KDE + C&C: 2.67 * 1010
evaluations

S-KDE in OpenMP
Initialization
Distribute samples
to threads
Fit bounding box
Chop into slices
Crop and compute
density
Accumulate density
to evaluation space
#pragma omp for
#pragma simd
#pragma atomic

S-KDE in OpenMP
• Targeting Multi and Many core processors
• Tested platforms:
• Intel i7 Intel Core i7 3820 CPU (4 Cores @ 3.6 GHz)
• Intel Xeon Phi 3120A (57 Cores @ 1.1 GHz, Native mode)
• Public KDE implementations used as yardsticks:
• Ks-kde (R Package)
• GPUML
• Several Python libraries

Execution time comparison

Conclusions
1) S-KDE + Chop & Crop reduces KDE complexity
2) Native, parallel implementation for Multi and Many-
core processors
• OpenMP
1) We beat state-of-the-art alternatives

Implementation of S-KDE in
General Purpose Coprocessors
32

S-KDE in OpenCL
Implementation of S-KDE in General Purpose Coprocessors 33
Initialization
...8 10 11 12 12
...0 8 18 29 41
Fit box & Chop
Crop
Offset calculation
Density computation
Density transfer
Density accumulation
(1)
(2)
(3)
(4)
(5)
(6)
(7)
• Host code
• Accelerator code

Execution time comparison

Conclusions
1) OpenCL version of S-KDE provides good overall
performance
2) The consolidation stage is the main bottleneck
3) The code is close to the limits of the accelerators
4) Further performance gains using pipelined execution

A Methodology for Environmental
Model Evaluation based on S-KDE
36

Climate models
• Mathematical representations of a climate system, based
on physical, chemical and biological principles
• They predict a trend in a long term time
• Recently used to asses the impact of greenhouse gases
A Methodology for Environmental Model Evaluation based on S-KDE 37

Climate model evaluation
• Models must be validated against actual observations
• There is not a universally accepted validation strategy
• Popular approaches:
• Averaged values per estimated variable
• Evaluating the per-variable Probability Density Functions (PDFs)

PDF-based model evaluation
• Current approaches:
1) Compute the PDF per estimated variable
2) Calculate similarity score per-variable against observations
3) Combine the scores to get global performance of the model
• Lack of a universally accepted way to combine the scores
• Our proposal:
• An extension of the score by [1] to multiple dimensions
• A methodology to evaluate multiple variables in a single step
[1]: Perkins, S. E., et al. "Evaluation of the AR4 climate models' simulated daily maximum
temperature, minimum temperature, and precipitation over Australia using probability density
functions." Journal of climate 20.17 (2007): 4356-4376.

Methodology
1) Estimate optimal
bandwidth
Iterative use
of KDE
h = 0.6 h = 0.65
Estimations
MIROCS3.2-MR Model
Observations
3) Compute score S = 0.74
2) Compute PDF
with opt. bandwidth
Single use
of KDE
PDF (Estimations)
h = 0.6
PDF (Observations)
h = 0.65
PDF (Observations)
h = 0.65

Evaluation
• Models: 7 from CMIP3 experiment (with different configurations)
• Dataset: 20C3M (1961 to 1998 on a daily basis)
• Variables:
• Global average of surface temperature
• Difference in temperature between N and S hemispheres
• Difference in temperature between Equator and the poles
• Scores for the models:
NCEP MIROCS
3.2MR-2
MIROCS
3.2MR-3
HADGE
M1
MIROCS
3.2MR-1
MIROCS
3.2-HR
GFDL-
CM2.1
GFDL-
CM2.0
BCM2.0 ECHAM
5
MRI-
RUN03
MRI-
RUN04
MRI-
RUN01
MRI-
RUN02
MRI-
RUN05
0,82 0,74 0,73 0,71 0,7 0,67 0,62 0,6 0,51 0,48 0,3 0,29 0,29 0,29 0,28

MIROC3.2-MR-RUN02
Score = 0,74
MRI-RUN01
Score = 0,29
Evaluation
Surface: Observations
Contour: Model
C0: Global average surface temperature
C1: Difference in temperature between Hemispheres
MIROC3
C2(K)
C1(K)

Conclusions
1)We have presented a methodology based on the
extension to multiple dimensions of the index by Perkins et
al.
2)It allows evaluating multiple variables of an environmental
model in a single step
3)It is feasible in time thanks to the use of a fast
implementation of KDE: S-KDE

Summary of contributions
•We have conducted an extensive survey on performance
models using a proposed taxonomy
•We have designed S-KDE, a technique that reduces the
complexity of Kernel Density Estimation computations
•We have implemented S-KDE for Multi and Many-cores
using OpenMP
• Outperforming the state-of-the-art parallel codes for KDE
Conclusions 45

Summary of contributions
• We have presented an OpenCL implementation of S-KDE
for general purpose coprocessors.
• It reaches the limits of the devices and acceptable performance,
but requires further work
• We have designed of a methodology for environmental
model evaluation based on KDE, that allows to evaluate
multiple variables from a model accurately in a simple
way
• S-KDE is a key, enabling element
Conclusions 46

Future work
• We intend to develop a methodology for the performance
evaluation accelerator-based applications, based on the
survey presented as first contribution
• We need to improve S-KDE in both multi-cores and
coprocessors
• In particular, the consolidation stage
• We intend to design a technique to analyse new climate
data from the CMIP Project, with dimensionality up to ten
Conclusions 47

Publications
Conclusions 48
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso.
A survey of performance modeling and simulation techniques for
accelerator-based computing. IEEE Transactions on Parallel and
Distributed Systems, 26(1):272–281, Jan 2015
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, and Jose
Miguel-Alonso. An efficient implementation of kernel density
estimation for multi-core & many-core architectures. International
Journal of High Performance Computing Applications, Accepted,
2015, DOI: 10.1177/1094342015576813

Publications
Conclusions 49
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-
Alonso. Kernel density estimation in accelerators: Implementation
and performance evaluation. Parallel Computing. To be
submitted.
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, Jose
Miguel-Alonso, Iñigo Errasti, Ganix Esnaola, Agustín Ezcurra, and
Gabriel Ibarra-Berastegi. Multi-objective environmental model
evaluation by means of multidimensional kernel density
estimators: Efficient and multi-core implementations.
Environmental Modelling & Software, 63:123 – 136, 2015

Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Similar to Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense] (20)

More from Unai Lopez-Novoa

More from Unai Lopez-Novoa (8)

Recently uploaded

Recently uploaded (20)

Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]