SlideShare a Scribd company logo
1 of 50
Unai Lopez Novoa
19 June 2015
Phd Dissertation
Advisors: Jose Miguel-Alonso & Alexander Mendiburu
Contributions to the Efficient Use of
General Purpose Coprocessors:
Kernel Density Estimation as Case Study
Outline
• Introduction
• Contributions
1) A Survey of Performance Modeling and Simulation Techniques
2) S-KDE: An Efficient Algorithm for Kernel Density Estimation
• And its implementation for Multi and Many-cores
1) Implementation of S-KDE in General Purpose Coprocessors
2) A Methodology for Environmental Model Evaluation based on S-KDE
• Conclusions
2
Introduction
3
High Performance Computing
• Branch of computer science related to the use of parallel
architectures to solve complex computational problems
• Today’s fastest supercomputer: Tianhe-2
4Introduction
(China’s National University of Defense and Technology, 33.86 PFLOP/s)
HPC Environments
• Traditional HPC systems were homogeneous, built
around single or multi-core CPUs
• But supercomputers are becoming heterogeneous
5Introduction
(Coprocessor number evolution in the Top500 list over time)
Compute platforms
Introduction 6
Multi-Core
CPUs
• Branch prediction, OoOE
• “Versatile”
Up to
250 GFLOP/s
Graphics
Processing
Units
• Hundreds of cores
• Handle thousands of threads Up to
1.8 TFLOP/s
Many-Core
Processors
• Tens of x86 cores
• HyperThreading
Up to
1 TFLOP/s
Device Features Peak D.P. Performance
Motivation
• Examples of successful porting of applications to
accelerators (compared againts multi-core implementations):
• SAXPY: 11.8x
• Polynomial Equation Solver: 79x
• Image Treatment (MRI): 263x
• …
• … but this is not applicable for every HPC code
Introduction 7
Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a
multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles
and practice of parallel programming. ACM, 2008. (>700 cites on Google Scholar)
Difficulties using accelerators
• Suitable codes for accelerators should:
• Expose high levels of parallelism
• Have a good spatial/temporal data locality
• …
• Porting a code requires extensive program rewriting
• Development tools for accelerators are not as polished
as those for CPUs
Effectively exploiting the performance of a
coprocessor remains as a challenging task
Introduction 8
Structure of this thesis
Introduction 9
A survey of performance modeling
and simulation techniques
Design of a novel algorithm for
Kernel Density Estimation: S-KDE
S-KDE for Multi & Many-Cores
S-KDE for Accelerators
A methodology for environmental
model evaluation based on S-KDE
Motivation:
Discuss the issues to efficiently use
general purpose coprocessors
Case Study:
Kernel Density Estimation applied
to environmental model evaluation
A Survey of Performance Modeling
and Simulation Techniques
10
Developing for accelerators
• Approaches/aids:
A Survey of Performance Modeling and Simulation Techniques 11
Trial and error
Profilers / Debuggers / …
Performance Models
A survey of models and simulators
• Accelerator & GPGPU trend began ~2005
• First performance models appeared ~2007
• Abundant literature
• No outstanding models or tools
A Survey of Performance Modeling and Simulation Techniques 12
Taxonomy
A Survey of Performance Modeling and Simulation Techniques 13
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
Model analysis
• We analysed 29 relevant accelerator models
• For each of them we summarized and identified:
• Modeling method (Analytical, Machine Learning,…)
• Target platforms and test devices
• Input preprocessing requirements
• Limitations
• Highlights over other models
A Survey of Performance Modeling and Simulation Techniques 14
The MWP-CWP model
• Presented by Hong & Kim in 2009 (>360 cites in Google Scholar)
• Estimates the execution time of a GPU application
• Based on how Warps are scheduled in NVIDIA GPUs
A Survey of Performance Modeling and Simulation Techniques 15
Test platform Input requirements Limitations HighlightsMethod
Analytical NVIDIA GPUs
(8800GT,…)
Run µbenchmarks
& Parse PTX
Branches not
modeled
Extendable to non-
NVIDIA GPUs
The Roofline model
• Presented by Williams et al. in 2009 (>450 cites in Google Scholar)
• Outstanding model for bottleneck highlighting
• Visual model:
A Survey of Performance Modeling and Simulation Techniques 16
Test platform Input requirements Limitations HighlightsMethod
Analytical Multi-core CPUs
& Accelerators
Run µbenchmarks
& Analyse application
Depends on
architecture
Visual output to guide
optimizations
Performance tools
• Some models require running performance tools
(µbenchmarks, profilers,…)
• We have reviewed them as well
A Survey of Performance Modeling and Simulation Techniques 17
Conclusions
1) There is no accurate model valid for a wide set of
architectures
2) Most models are tied to CUDA
3) There is a growing interest in analyzing power
4) It was impossible to make a comparison of the models
(lack of details, codes, …)
A Survey of Performance Modeling and Simulation Techniques 18
S-KDE: An Efficient Algorithm for
Kernel Density Estimation
(and its implementation for Multi and Many-cores)
19
Case study
• Collaborative Work:
EOLO
UPV/EHU Climate and Meteorology Group
• Scenario:
Environmental Model Evaluation
• Problem:
Excessive execution times of KDE
S-KDE: An Efficient Algorithm for Kernel Density Estimation 20
Kernel Density Estimation
• Statistical technique used to estimate the Probability
Density Function (PDF) of a random variable with
unknown characteristics
• where:
• xi are the samples from the random variables
• K is the Kernel function
• H is the bandwidth value
S-KDE: An Efficient Algorithm for Kernel Density Estimation 21
Kernel function
• Symmetric function that integrates to one
• We classify them according to area of influence
S-KDE: An Efficient Algorithm for Kernel Density Estimation 22
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
Density
x
Gaussian
0
0.2
0.4
0.6
0.8
1
1.2
-1 -0.5 0 0.5 1
Density
x
Epanechnikov
Bounded Unbounded
Bandwidth
• Parameter to control the smoothness of the estimation
• It must be carefully selected
• Common approaches for its selection
• Heuristic as in Silverman, 1986
• Iterative technique, e.g., bootstraping
S-KDE: An Efficient Algorithm for Kernel Density Estimation 23
Computing KDE
S-KDE: An Efficient Algorithm for Kernel Density Estimation 24
Naive approach: EP-KDE
for each eval_point e in E
for each sample s in S
d = distance(e,s)
e += density (d)
Our proposal: S-KDE
for each sample s in S
B = findInfluenceArea(s)
for each eval_point e in B
d = distance(e,s)
e += density (d)
Complexity: O(|E|·|S|) Complexity: O(|B|·|S|)
Delimiting the influence area
S-KDE: An Efficient Algorithm for Kernel Density Estimation 25
• Depends on the Kernel
• Our case: Epanechnikov kernel
• Technique based on a method in Fukunaga, 1990
Chop & Crop
• In spaces of dimensionality 3 and higher, the number of
evaluation points outside the influence area increases
• We developed a technique to further reduce evaluations:
S-KDE: An Efficient Algorithm for Kernel Density Estimation 26
Step 1: Chop the box into slices Step 2: Crop the slice
Example numbers
500k Samples 3D dataset
194M Evaluation point space
EP-KDE: 9.74 * 1013
distance-density evaluations
102461 Evaluation points per Bounding box (average)
S-KDE: 5.12 * 1010
evaluations
With C&C: 53511 Evaluation point per Bounding box (average)
S-KDE + C&C: 2.67 * 1010
evaluations
S-KDE: An Efficient Algorithm for Kernel Density Estimation 27
S-KDE in OpenMP
S-KDE: An Efficient Algorithm for Kernel Density Estimation 28
Initialization
Distribute samples
to threads
Fit bounding box
Chop into slices
Crop and compute
density
Accumulate density
to evaluation space
#pragma omp for
#pragma simd
#pragma atomic
S-KDE in OpenMP
• Targeting Multi and Many core processors
• Tested platforms:
• Intel i7 Intel Core i7 3820 CPU (4 Cores @ 3.6 GHz)
• Intel Xeon Phi 3120A (57 Cores @ 1.1 GHz, Native mode)
• Public KDE implementations used as yardsticks:
• Ks-kde (R Package)
• GPUML
• Several Python libraries
S-KDE: An Efficient Algorithm for Kernel Density Estimation 29
Execution time comparison
S-KDE: An Efficient Algorithm for Kernel Density Estimation 30
Conclusions
1) S-KDE + Chop & Crop reduces KDE complexity
2) Native, parallel implementation for Multi and Many-
core processors
• OpenMP
1) We beat state-of-the-art alternatives
S-KDE: An Efficient Algorithm for Kernel Density Estimation 31
Implementation of S-KDE in
General Purpose Coprocessors
32
S-KDE in OpenCL
Implementation of S-KDE in General Purpose Coprocessors 33
Initialization
...8 10 11 12 12
...0 8 18 29 41
Fit box & Chop
Crop
Offset calculation
Density computation
Density transfer
Density accumulation
(1)
(2)
(3)
(4)
(5)
(6)
(7)
• Host code
• Accelerator code
Execution time comparison
Implementation of S-KDE in General Purpose Coprocessors 34
Conclusions
1) OpenCL version of S-KDE provides good overall
performance
2) The consolidation stage is the main bottleneck
3) The code is close to the limits of the accelerators
4) Further performance gains using pipelined execution
Implementation of S-KDE in General Purpose Coprocessors 35
A Methodology for Environmental
Model Evaluation based on S-KDE
36
Climate models
• Mathematical representations of a climate system, based
on physical, chemical and biological principles
• They predict a trend in a long term time
• Recently used to asses the impact of greenhouse gases
A Methodology for Environmental Model Evaluation based on S-KDE 37
Climate model evaluation
• Models must be validated against actual observations
• There is not a universally accepted validation strategy
• Popular approaches:
• Averaged values per estimated variable
• Evaluating the per-variable Probability Density Functions (PDFs)
A Methodology for Environmental Model Evaluation based on S-KDE 38
PDF-based model evaluation
• Current approaches:
1) Compute the PDF per estimated variable
2) Calculate similarity score per-variable against observations
3) Combine the scores to get global performance of the model
• Lack of a universally accepted way to combine the scores
• Our proposal:
• An extension of the score by [1] to multiple dimensions
• A methodology to evaluate multiple variables in a single step
A Methodology for Environmental Model Evaluation based on S-KDE 39
[1]: Perkins, S. E., et al. "Evaluation of the AR4 climate models' simulated daily maximum
temperature, minimum temperature, and precipitation over Australia using probability density
functions." Journal of climate 20.17 (2007): 4356-4376.
Methodology
A Methodology for Environmental Model Evaluation based on S-KDE 40
1) Estimate optimal
bandwidth
Iterative use
of KDE
h = 0.6 h = 0.65
Estimations
MIROCS3.2-MR Model
Observations
3) Compute score S = 0.74
2) Compute PDF
with opt. bandwidth
Single use
of KDE
PDF (Estimations)
h = 0.6
PDF (Observations)
h = 0.65
PDF (Observations)
h = 0.65
Evaluation
• Models: 7 from CMIP3 experiment (with different configurations)
• Dataset: 20C3M (1961 to 1998 on a daily basis)
• Variables:
• Global average of surface temperature
• Difference in temperature between N and S hemispheres
• Difference in temperature between Equator and the poles
• Scores for the models:
A Methodology for Environmental Model Evaluation based on S-KDE 41
NCEP MIROCS
3.2MR-2
MIROCS
3.2MR-3
HADGE
M1
MIROCS
3.2MR-1
MIROCS
3.2-HR
GFDL-
CM2.1
GFDL-
CM2.0
BCM2.0 ECHAM
5
MRI-
RUN03
MRI-
RUN04
MRI-
RUN01
MRI-
RUN02
MRI-
RUN05
0,82 0,74 0,73 0,71 0,7 0,67 0,62 0,6 0,51 0,48 0,3 0,29 0,29 0,29 0,28
A Methodology for Environmental Model Evaluation based on S-KDE 42
MIROC3.2-MR-RUN02
Score = 0,74
MRI-RUN01
Score = 0,29
Evaluation
Surface: Observations
Contour: Model
C0: Global average surface temperature
C1: Difference in temperature between Hemispheres
MIROC3
C2(K)
C1(K)
Conclusions
1)We have presented a methodology based on the
extension to multiple dimensions of the index by Perkins et
al.
2)It allows evaluating multiple variables of an environmental
model in a single step
3)It is feasible in time thanks to the use of a fast
implementation of KDE: S-KDE
A Methodology for Environmental Model Evaluation based on S-KDE 43
Conclusions
44
Summary of contributions
•We have conducted an extensive survey on performance
models using a proposed taxonomy
•We have designed S-KDE, a technique that reduces the
complexity of Kernel Density Estimation computations
•We have implemented S-KDE for Multi and Many-cores
using OpenMP
• Outperforming the state-of-the-art parallel codes for KDE
Conclusions 45
Summary of contributions
• We have presented an OpenCL implementation of S-KDE
for general purpose coprocessors.
• It reaches the limits of the devices and acceptable performance,
but requires further work
• We have designed of a methodology for environmental
model evaluation based on KDE, that allows to evaluate
multiple variables from a model accurately in a simple
way
• S-KDE is a key, enabling element
Conclusions 46
Future work
• We intend to develop a methodology for the performance
evaluation accelerator-based applications, based on the
survey presented as first contribution
• We need to improve S-KDE in both multi-cores and
coprocessors
• In particular, the consolidation stage
• We intend to design a technique to analyse new climate
data from the CMIP Project, with dimensionality up to ten
Conclusions 47
Publications
Conclusions 48
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso.
A survey of performance modeling and simulation techniques for
accelerator-based computing. IEEE Transactions on Parallel and
Distributed Systems, 26(1):272–281, Jan 2015
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, and Jose
Miguel-Alonso. An efficient implementation of kernel density
estimation for multi-core & many-core architectures. International
Journal of High Performance Computing Applications, Accepted,
2015, DOI: 10.1177/1094342015576813
Publications
Conclusions 49
Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-
Alonso. Kernel density estimation in accelerators: Implementation
and performance evaluation. Parallel Computing. To be
submitted.
Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, Jose
Miguel-Alonso, Iñigo Errasti, Ganix Esnaola, Agustín Ezcurra, and
Gabriel Ibarra-Berastegi. Multi-objective environmental model
evaluation by means of multidimensional kernel density
estimators: Efficient and multi-core implementations.
Environmental Modelling & Software, 63:123 – 136, 2015
Unai Lopez Novoa
19 June 2015
Phd Dissertation
Advisors: Jose Miguel-Alonso & Alexander Mendiburu
Contributions to the Efficient Use of
General Purpose Coprocessors:
Kernel Density Estimation as Case Study

More Related Content

What's hot

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
VaMoS 2022 - Transfer Learning across Distinct Software Systems
VaMoS 2022 - Transfer Learning across Distinct Software SystemsVaMoS 2022 - Transfer Learning across Distinct Software Systems
VaMoS 2022 - Transfer Learning across Distinct Software SystemsLuc Lesoil
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeAdaCore
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...Edge AI and Vision Alliance
 
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesFilip Krikava
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesNECST Lab @ Politecnico di Milano
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisTransfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisPooyan Jamshidi
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 
Parallel Left Ventricle Simulation Using the FEniCS Framework
Parallel Left Ventricle Simulation Using the FEniCS FrameworkParallel Left Ventricle Simulation Using the FEniCS Framework
Parallel Left Ventricle Simulation Using the FEniCS FrameworkUral-PDC
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...OW2
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on sparkSatyendra Rana
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...Pooyan Jamshidi
 
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Dave Callen
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksJunKudo2
 

What's hot (20)

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
VaMoS 2022 - Transfer Learning across Distinct Software Systems
VaMoS 2022 - Transfer Learning across Distinct Software SystemsVaMoS 2022 - Transfer Learning across Distinct Software Systems
VaMoS 2022 - Transfer Learning across Distinct Software Systems
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical College
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software ArchitecturesACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
ACTRESS: Domain-Specific Modeling of Self-Adaptive Software Architectures
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory AnalysisTransfer Learning for Software Performance Analysis: An Exploratory Analysis
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
Parallel Left Ventricle Simulation Using the FEniCS Framework
Parallel Left Ventricle Simulation Using the FEniCS FrameworkParallel Left Ventricle Simulation Using the FEniCS Framework
Parallel Left Ventricle Simulation Using the FEniCS Framework
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...
Using Community Clouds for Load Testing- the ProActive CLIF solution, OW2con'...
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
 
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...
Transfer Learning for Performance Analysis of Configurable Systems: A Causal ...
 
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
Briefing - The Atlast V Aft Bulkhead Carrier Update - Past Missions, Upcoming...
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 

Similar to Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningCloudLightning
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo
 
SERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolSERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolHenry Muccini
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENEWorkshop
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
CocomoModels MGK .ppt
CocomoModels MGK .pptCocomoModels MGK .ppt
CocomoModels MGK .pptssuser3d1dad3
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Crepe Complete -- Slides CMSEBA2014
Crepe Complete -- Slides CMSEBA2014Crepe Complete -- Slides CMSEBA2014
Crepe Complete -- Slides CMSEBA2014Steffen Zschaler
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...waqarnabi
 

Similar to Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense] (20)

CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf01-06 OCRE Test Suite - Fernandes.pdf
01-06 OCRE Test Suite - Fernandes.pdf
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Simulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightningSimulating Heterogeneous Resources in CloudLightning
Simulating Heterogeneous Resources in CloudLightning
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
SERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_schoolSERENE 2014 School: Daniel varro serene2014_school
SERENE 2014 School: Daniel varro serene2014_school
 
SERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the CloudSERENE 2014 School: Incremental Model Queries over the Cloud
SERENE 2014 School: Incremental Model Queries over the Cloud
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
CocomoModels MGK .ppt
CocomoModels MGK .pptCocomoModels MGK .ppt
CocomoModels MGK .ppt
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Crepe Complete -- Slides CMSEBA2014
Crepe Complete -- Slides CMSEBA2014Crepe Complete -- Slides CMSEBA2014
Crepe Complete -- Slides CMSEBA2014
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
A High-Level Programming Approach for using FPGAs in HPC using Functional Des...
 

More from Unai Lopez-Novoa

Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...Unai Lopez-Novoa
 
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...Unai Lopez-Novoa
 
Introducción a la Computación Paralela
Introducción a la Computación ParalelaIntroducción a la Computación Paralela
Introducción a la Computación ParalelaUnai Lopez-Novoa
 
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de RendimientoComputación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de RendimientoUnai Lopez-Novoa
 
Tolerancia a fallos en MPI con Checkpointing
Tolerancia a fallos en MPI con CheckpointingTolerancia a fallos en MPI con Checkpointing
Tolerancia a fallos en MPI con CheckpointingUnai Lopez-Novoa
 

More from Unai Lopez-Novoa (8)

Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...Exploring performance and energy consumption differences between recent Intel...
Exploring performance and energy consumption differences between recent Intel...
 
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
 
Introducción a la Computación Paralela
Introducción a la Computación ParalelaIntroducción a la Computación Paralela
Introducción a la Computación Paralela
 
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de RendimientoComputación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Tolerancia a fallos en MPI con Checkpointing
Tolerancia a fallos en MPI con CheckpointingTolerancia a fallos en MPI con Checkpointing
Tolerancia a fallos en MPI con Checkpointing
 
Introduccion a MPI
Introduccion a MPIIntroduccion a MPI
Introduccion a MPI
 

Recently uploaded

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 

Recently uploaded (20)

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 

Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

  • 1. Unai Lopez Novoa 19 June 2015 Phd Dissertation Advisors: Jose Miguel-Alonso & Alexander Mendiburu Contributions to the Efficient Use of General Purpose Coprocessors: Kernel Density Estimation as Case Study
  • 2. Outline • Introduction • Contributions 1) A Survey of Performance Modeling and Simulation Techniques 2) S-KDE: An Efficient Algorithm for Kernel Density Estimation • And its implementation for Multi and Many-cores 1) Implementation of S-KDE in General Purpose Coprocessors 2) A Methodology for Environmental Model Evaluation based on S-KDE • Conclusions 2
  • 4. High Performance Computing • Branch of computer science related to the use of parallel architectures to solve complex computational problems • Today’s fastest supercomputer: Tianhe-2 4Introduction (China’s National University of Defense and Technology, 33.86 PFLOP/s)
  • 5. HPC Environments • Traditional HPC systems were homogeneous, built around single or multi-core CPUs • But supercomputers are becoming heterogeneous 5Introduction (Coprocessor number evolution in the Top500 list over time)
  • 6. Compute platforms Introduction 6 Multi-Core CPUs • Branch prediction, OoOE • “Versatile” Up to 250 GFLOP/s Graphics Processing Units • Hundreds of cores • Handle thousands of threads Up to 1.8 TFLOP/s Many-Core Processors • Tens of x86 cores • HyperThreading Up to 1 TFLOP/s Device Features Peak D.P. Performance
  • 7. Motivation • Examples of successful porting of applications to accelerators (compared againts multi-core implementations): • SAXPY: 11.8x • Polynomial Equation Solver: 79x • Image Treatment (MRI): 263x • … • … but this is not applicable for every HPC code Introduction 7 Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008. (>700 cites on Google Scholar)
  • 8. Difficulties using accelerators • Suitable codes for accelerators should: • Expose high levels of parallelism • Have a good spatial/temporal data locality • … • Porting a code requires extensive program rewriting • Development tools for accelerators are not as polished as those for CPUs Effectively exploiting the performance of a coprocessor remains as a challenging task Introduction 8
  • 9. Structure of this thesis Introduction 9 A survey of performance modeling and simulation techniques Design of a novel algorithm for Kernel Density Estimation: S-KDE S-KDE for Multi & Many-Cores S-KDE for Accelerators A methodology for environmental model evaluation based on S-KDE Motivation: Discuss the issues to efficiently use general purpose coprocessors Case Study: Kernel Density Estimation applied to environmental model evaluation
  • 10. A Survey of Performance Modeling and Simulation Techniques 10
  • 11. Developing for accelerators • Approaches/aids: A Survey of Performance Modeling and Simulation Techniques 11 Trial and error Profilers / Debuggers / … Performance Models
  • 12. A survey of models and simulators • Accelerator & GPGPU trend began ~2005 • First performance models appeared ~2007 • Abundant literature • No outstanding models or tools A Survey of Performance Modeling and Simulation Techniques 12
  • 13. Taxonomy A Survey of Performance Modeling and Simulation Techniques 13 Execution time estimation Bottleneck highlighting Power cons. estimation Simulators
  • 14. Model analysis • We analysed 29 relevant accelerator models • For each of them we summarized and identified: • Modeling method (Analytical, Machine Learning,…) • Target platforms and test devices • Input preprocessing requirements • Limitations • Highlights over other models A Survey of Performance Modeling and Simulation Techniques 14
  • 15. The MWP-CWP model • Presented by Hong & Kim in 2009 (>360 cites in Google Scholar) • Estimates the execution time of a GPU application • Based on how Warps are scheduled in NVIDIA GPUs A Survey of Performance Modeling and Simulation Techniques 15 Test platform Input requirements Limitations HighlightsMethod Analytical NVIDIA GPUs (8800GT,…) Run µbenchmarks & Parse PTX Branches not modeled Extendable to non- NVIDIA GPUs
  • 16. The Roofline model • Presented by Williams et al. in 2009 (>450 cites in Google Scholar) • Outstanding model for bottleneck highlighting • Visual model: A Survey of Performance Modeling and Simulation Techniques 16 Test platform Input requirements Limitations HighlightsMethod Analytical Multi-core CPUs & Accelerators Run µbenchmarks & Analyse application Depends on architecture Visual output to guide optimizations
  • 17. Performance tools • Some models require running performance tools (µbenchmarks, profilers,…) • We have reviewed them as well A Survey of Performance Modeling and Simulation Techniques 17
  • 18. Conclusions 1) There is no accurate model valid for a wide set of architectures 2) Most models are tied to CUDA 3) There is a growing interest in analyzing power 4) It was impossible to make a comparison of the models (lack of details, codes, …) A Survey of Performance Modeling and Simulation Techniques 18
  • 19. S-KDE: An Efficient Algorithm for Kernel Density Estimation (and its implementation for Multi and Many-cores) 19
  • 20. Case study • Collaborative Work: EOLO UPV/EHU Climate and Meteorology Group • Scenario: Environmental Model Evaluation • Problem: Excessive execution times of KDE S-KDE: An Efficient Algorithm for Kernel Density Estimation 20
  • 21. Kernel Density Estimation • Statistical technique used to estimate the Probability Density Function (PDF) of a random variable with unknown characteristics • where: • xi are the samples from the random variables • K is the Kernel function • H is the bandwidth value S-KDE: An Efficient Algorithm for Kernel Density Estimation 21
  • 22. Kernel function • Symmetric function that integrates to one • We classify them according to area of influence S-KDE: An Efficient Algorithm for Kernel Density Estimation 22 0 0.2 0.4 0.6 0.8 1 -3 -2 -1 0 1 2 3 Density x Gaussian 0 0.2 0.4 0.6 0.8 1 1.2 -1 -0.5 0 0.5 1 Density x Epanechnikov Bounded Unbounded
  • 23. Bandwidth • Parameter to control the smoothness of the estimation • It must be carefully selected • Common approaches for its selection • Heuristic as in Silverman, 1986 • Iterative technique, e.g., bootstraping S-KDE: An Efficient Algorithm for Kernel Density Estimation 23
  • 24. Computing KDE S-KDE: An Efficient Algorithm for Kernel Density Estimation 24 Naive approach: EP-KDE for each eval_point e in E for each sample s in S d = distance(e,s) e += density (d) Our proposal: S-KDE for each sample s in S B = findInfluenceArea(s) for each eval_point e in B d = distance(e,s) e += density (d) Complexity: O(|E|·|S|) Complexity: O(|B|·|S|)
  • 25. Delimiting the influence area S-KDE: An Efficient Algorithm for Kernel Density Estimation 25 • Depends on the Kernel • Our case: Epanechnikov kernel • Technique based on a method in Fukunaga, 1990
  • 26. Chop & Crop • In spaces of dimensionality 3 and higher, the number of evaluation points outside the influence area increases • We developed a technique to further reduce evaluations: S-KDE: An Efficient Algorithm for Kernel Density Estimation 26 Step 1: Chop the box into slices Step 2: Crop the slice
  • 27. Example numbers 500k Samples 3D dataset 194M Evaluation point space EP-KDE: 9.74 * 1013 distance-density evaluations 102461 Evaluation points per Bounding box (average) S-KDE: 5.12 * 1010 evaluations With C&C: 53511 Evaluation point per Bounding box (average) S-KDE + C&C: 2.67 * 1010 evaluations S-KDE: An Efficient Algorithm for Kernel Density Estimation 27
  • 28. S-KDE in OpenMP S-KDE: An Efficient Algorithm for Kernel Density Estimation 28 Initialization Distribute samples to threads Fit bounding box Chop into slices Crop and compute density Accumulate density to evaluation space #pragma omp for #pragma simd #pragma atomic
  • 29. S-KDE in OpenMP • Targeting Multi and Many core processors • Tested platforms: • Intel i7 Intel Core i7 3820 CPU (4 Cores @ 3.6 GHz) • Intel Xeon Phi 3120A (57 Cores @ 1.1 GHz, Native mode) • Public KDE implementations used as yardsticks: • Ks-kde (R Package) • GPUML • Several Python libraries S-KDE: An Efficient Algorithm for Kernel Density Estimation 29
  • 30. Execution time comparison S-KDE: An Efficient Algorithm for Kernel Density Estimation 30
  • 31. Conclusions 1) S-KDE + Chop & Crop reduces KDE complexity 2) Native, parallel implementation for Multi and Many- core processors • OpenMP 1) We beat state-of-the-art alternatives S-KDE: An Efficient Algorithm for Kernel Density Estimation 31
  • 32. Implementation of S-KDE in General Purpose Coprocessors 32
  • 33. S-KDE in OpenCL Implementation of S-KDE in General Purpose Coprocessors 33 Initialization ...8 10 11 12 12 ...0 8 18 29 41 Fit box & Chop Crop Offset calculation Density computation Density transfer Density accumulation (1) (2) (3) (4) (5) (6) (7) • Host code • Accelerator code
  • 34. Execution time comparison Implementation of S-KDE in General Purpose Coprocessors 34
  • 35. Conclusions 1) OpenCL version of S-KDE provides good overall performance 2) The consolidation stage is the main bottleneck 3) The code is close to the limits of the accelerators 4) Further performance gains using pipelined execution Implementation of S-KDE in General Purpose Coprocessors 35
  • 36. A Methodology for Environmental Model Evaluation based on S-KDE 36
  • 37. Climate models • Mathematical representations of a climate system, based on physical, chemical and biological principles • They predict a trend in a long term time • Recently used to asses the impact of greenhouse gases A Methodology for Environmental Model Evaluation based on S-KDE 37
  • 38. Climate model evaluation • Models must be validated against actual observations • There is not a universally accepted validation strategy • Popular approaches: • Averaged values per estimated variable • Evaluating the per-variable Probability Density Functions (PDFs) A Methodology for Environmental Model Evaluation based on S-KDE 38
  • 39. PDF-based model evaluation • Current approaches: 1) Compute the PDF per estimated variable 2) Calculate similarity score per-variable against observations 3) Combine the scores to get global performance of the model • Lack of a universally accepted way to combine the scores • Our proposal: • An extension of the score by [1] to multiple dimensions • A methodology to evaluate multiple variables in a single step A Methodology for Environmental Model Evaluation based on S-KDE 39 [1]: Perkins, S. E., et al. "Evaluation of the AR4 climate models' simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions." Journal of climate 20.17 (2007): 4356-4376.
  • 40. Methodology A Methodology for Environmental Model Evaluation based on S-KDE 40 1) Estimate optimal bandwidth Iterative use of KDE h = 0.6 h = 0.65 Estimations MIROCS3.2-MR Model Observations 3) Compute score S = 0.74 2) Compute PDF with opt. bandwidth Single use of KDE PDF (Estimations) h = 0.6 PDF (Observations) h = 0.65 PDF (Observations) h = 0.65
  • 41. Evaluation • Models: 7 from CMIP3 experiment (with different configurations) • Dataset: 20C3M (1961 to 1998 on a daily basis) • Variables: • Global average of surface temperature • Difference in temperature between N and S hemispheres • Difference in temperature between Equator and the poles • Scores for the models: A Methodology for Environmental Model Evaluation based on S-KDE 41 NCEP MIROCS 3.2MR-2 MIROCS 3.2MR-3 HADGE M1 MIROCS 3.2MR-1 MIROCS 3.2-HR GFDL- CM2.1 GFDL- CM2.0 BCM2.0 ECHAM 5 MRI- RUN03 MRI- RUN04 MRI- RUN01 MRI- RUN02 MRI- RUN05 0,82 0,74 0,73 0,71 0,7 0,67 0,62 0,6 0,51 0,48 0,3 0,29 0,29 0,29 0,28
  • 42. A Methodology for Environmental Model Evaluation based on S-KDE 42 MIROC3.2-MR-RUN02 Score = 0,74 MRI-RUN01 Score = 0,29 Evaluation Surface: Observations Contour: Model C0: Global average surface temperature C1: Difference in temperature between Hemispheres MIROC3 C2(K) C1(K)
  • 43. Conclusions 1)We have presented a methodology based on the extension to multiple dimensions of the index by Perkins et al. 2)It allows evaluating multiple variables of an environmental model in a single step 3)It is feasible in time thanks to the use of a fast implementation of KDE: S-KDE A Methodology for Environmental Model Evaluation based on S-KDE 43
  • 45. Summary of contributions •We have conducted an extensive survey on performance models using a proposed taxonomy •We have designed S-KDE, a technique that reduces the complexity of Kernel Density Estimation computations •We have implemented S-KDE for Multi and Many-cores using OpenMP • Outperforming the state-of-the-art parallel codes for KDE Conclusions 45
  • 46. Summary of contributions • We have presented an OpenCL implementation of S-KDE for general purpose coprocessors. • It reaches the limits of the devices and acceptable performance, but requires further work • We have designed of a methodology for environmental model evaluation based on KDE, that allows to evaluate multiple variables from a model accurately in a simple way • S-KDE is a key, enabling element Conclusions 46
  • 47. Future work • We intend to develop a methodology for the performance evaluation accelerator-based applications, based on the survey presented as first contribution • We need to improve S-KDE in both multi-cores and coprocessors • In particular, the consolidation stage • We intend to design a technique to analyse new climate data from the CMIP Project, with dimensionality up to ten Conclusions 47
  • 48. Publications Conclusions 48 Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso. A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Transactions on Parallel and Distributed Systems, 26(1):272–281, Jan 2015 Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, and Jose Miguel-Alonso. An efficient implementation of kernel density estimation for multi-core & many-core architectures. International Journal of High Performance Computing Applications, Accepted, 2015, DOI: 10.1177/1094342015576813
  • 49. Publications Conclusions 49 Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel- Alonso. Kernel density estimation in accelerators: Implementation and performance evaluation. Parallel Computing. To be submitted. Unai Lopez-Novoa, Jon Sáenz, Alexander Mendiburu, Jose Miguel-Alonso, Iñigo Errasti, Ganix Esnaola, Agustín Ezcurra, and Gabriel Ibarra-Berastegi. Multi-objective environmental model evaluation by means of multidimensional kernel density estimators: Efficient and multi-core implementations. Environmental Modelling & Software, 63:123 – 136, 2015
  • 50. Unai Lopez Novoa 19 June 2015 Phd Dissertation Advisors: Jose Miguel-Alonso & Alexander Mendiburu Contributions to the Efficient Use of General Purpose Coprocessors: Kernel Density Estimation as Case Study