SlideShare a Scribd company logo
1 of 27
Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU
Outline
• Previous work
• Work @ UniMan: Relational Join
1.Motivation
2.Algorithm
3.Results
4.Conclusions
2
About Myself
• PhD Student @ Intelligent Systems Group: 2011 – Now
• Research interest: Efficient use of Modern coprocessors
• Performance modeling
• Code acceleration
• Development of parallel implementations
• Molecular Dynamics simulation code (MSc thesis)
• Kernel Density Estimation (Under review)
• Relational Join (Work @ UniMan)
3
Kernel Density Estimation
• Estimate the Probability Density Function of a population
• Our use case: Climate models
• Challenge: large volumes of data
4
Histogram: KDE:
Kernel Density Estimation
• 1st
: Algorithmic rework
• 2nd
: Parallel implementation: multi/many core processors
• Compared to R+MKL and CUDA implementations
Naive approach
for each evaluation_point e
for each sample s
d = distance(e,s)
e += density (d)
Our approach
B = computeBoundingBox()
for each sample s
b = fitBoundingBox(B,s)
for each e_point e in b
d = distance(e,s)
e += density (d)
5
Work @ UniMan
6
Join
Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data
partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM,
2013.
Do sunblock sales correlate with weather?
Sales
Weather
Join-Date(Sales,Weather)
Join-Date
7
Join
•Join is everyday operation
8
Join
Goal: Develop a parallel implementation of relational
join targeting nowadays heterogeneous systems
9
Heterogeneous systems
• Performance depends on the nature of the application
Multi-core
•16 cores
•250 GFLOP/s
Many-core
•61 cores
•1 TFLOP/s
GPU
•2880 cores
•1.3 TFLOP/s
Complex control flow Number crunchingComplex control flow Number crunching
10
• Wide variety of programming environments in HPC
• OpenMP, CUDA, MPI, TBB,…
• Our choice: OpenCL
Heterogeneous systems
NVIDIA SDKIntel SDKAMD SDK
Write once
Compile
Run many
11
Heterogeneous systems
• Cross-platform portability != Performance portability
• OpenCL: Abstraction layer
• Solution 1: per-device hand-made tuning
• Not portable at all
• Solution 2: auto tuning
• Rely on performance models
12
Previous work
• Collection of performance modeling proposals for latest
GPUs and Intel Xeon Phi
• Comprehensive analysis of the literature since ~2007
• Organized as:
Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation
Techniques for Accelerator-based Computing IEEE Transactions on
Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
13
Types of Join
100
103
104
100
102
Inner Left Outer
Right Outer Full Outer
100 100 100 100
103 -
104 -
100 100
- 102
100 100
103 -
104 -
- 102
Table A
Table B
14
Algorithm
• Biggest debate: Sort or Hash?
Hash-join
Complexity:
Limitation: Extensive use of
atomics prevent
efficient parallelization
O(n + m)
Procedure: 1. Hash smaller table
2. Scan larger table
Sort-join
Sorting increases
complexity
O(n·log(n))
1. Sort keys
2. Scan interleaved
15
Algorithm
• Step 1: Sort keys in both tables
• Radix sort: speed/scalability sweet spot
100
104
103
103
102
100
100
102
101
102
100
100
102
103
103
104
100
101
102
102
Sort
16
Algorithm
• Step 2: Merge
• Add non matching keys for outer joins
100
100
102
103
103
104
100
101
102
102
100 100
100 100
102 102
102 102
Table A Table B
Result – Inner Join
17
Implementation
• Steps:
1)Develop a naive OpenCL implementation
2)Optimize per device type
3)Add a cost model for load balancing and partitioning
• Experimental setup:
• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU
• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU
• Baseline: ModernGPU (CUDA)
18
Results
19
Per-device tuning
• Optimizations:
• Thread scheduling
• Memory management
• Overheads:
• Compilation
• Memory allocation
20
Optimizations
• Per device thread scheduling
OpenCL
Kernel
Threads:
Groups:
OpenCL
Devices
Four core CPU
0 1 2 3
61 core Xeon Phi
21
2 3 4 600 1
• Per device memory management
Optimizations
Private Local Global
OpenCL Device
Memory Hierarchy
Thread Thread-group Any thread
22
Scope:
Registers On-chip RAM
Registers RAM
RAMRegisters
RAM
RAM
Overheads
• Compilation
• Online compilation: X% of runtime (without I/O)
• Memory allocation
• Intel SDK: Y % of Merge Step in Xeon Phi
OpenCL
Program
Host code Device code
Compilation: Offline (gcc) Online (SDK)
23
Results
24
Future work
1) Finish tuning per device code
2) Test join in FPGA
3) Revisit partitioning strategy
4) Support multi-device execution
• Develop a cost model that characterizes Join
• Split the workload in runtime among existing devices
25
Conclusions
• Performance: device specific code
• Performance portability:
a) Platform specific code
b) Parameterizable code
• High OpenCL SDK dependence
• Only portable debugging tool: printf
• …but still the only portable framework
• Future: OpenACC / OpenMP 4.0 ?
26
Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU

More Related Content

What's hot

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 

What's hot (20)

PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
PEARC17: Improving Uintah's Scalability Through the Use of Portable Kokkos-Ba...
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 

Viewers also liked (6)

Introducción a la Computación Paralela
Introducción a la Computación ParalelaIntroducción a la Computación Paralela
Introducción a la Computación Paralela
 
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
A Platform for Overcrowding Detection in Indoor Events using Scalable Technol...
 
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de RendimientoComputación Heterogénea: Aplicaciones y Modelado de Rendimiento
Computación Heterogénea: Aplicaciones y Modelado de Rendimiento
 
Introduccion a MPI
Introduccion a MPIIntroduccion a MPI
Introduccion a MPI
 
OpenMP - Configuración de IDE y ejecución de código paralelo
OpenMP - Configuración de IDE y ejecución de código paraleloOpenMP - Configuración de IDE y ejecución de código paralelo
OpenMP - Configuración de IDE y ejecución de código paralelo
 
Introducción al Grid Computing
Introducción al Grid ComputingIntroducción al Grid Computing
Introducción al Grid Computing
 

Similar to Harnessing OpenCL in Modern Coprocessors

Similar to Harnessing OpenCL in Modern Coprocessors (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC Systems
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Deep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorchDeep_Learning_Frameworks_CNTK_PyTorch
Deep_Learning_Frameworks_CNTK_PyTorch
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
Improving the Efficiency of Cloud Infrastructures with Elastic Tandem Machine...
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 

Harnessing OpenCL in Modern Coprocessors

  • 1. Harnessing OpenCL in modern coprocessors Unai Lopez-Novoa unai.lopez@ehu.es 06 Aug 2014 Intelligent Systems Group University of the Basque Country UPV/EHU
  • 2. Outline • Previous work • Work @ UniMan: Relational Join 1.Motivation 2.Algorithm 3.Results 4.Conclusions 2
  • 3. About Myself • PhD Student @ Intelligent Systems Group: 2011 – Now • Research interest: Efficient use of Modern coprocessors • Performance modeling • Code acceleration • Development of parallel implementations • Molecular Dynamics simulation code (MSc thesis) • Kernel Density Estimation (Under review) • Relational Join (Work @ UniMan) 3
  • 4. Kernel Density Estimation • Estimate the Probability Density Function of a population • Our use case: Climate models • Challenge: large volumes of data 4 Histogram: KDE:
  • 5. Kernel Density Estimation • 1st : Algorithmic rework • 2nd : Parallel implementation: multi/many core processors • Compared to R+MKL and CUDA implementations Naive approach for each evaluation_point e for each sample s d = distance(e,s) e += density (d) Our approach B = computeBoundingBox() for each sample s b = fitBoundingBox(B,s) for each e_point e in b d = distance(e,s) e += density (d) 5
  • 7. Join Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013. Do sunblock sales correlate with weather? Sales Weather Join-Date(Sales,Weather) Join-Date 7
  • 9. Join Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems 9
  • 10. Heterogeneous systems • Performance depends on the nature of the application Multi-core •16 cores •250 GFLOP/s Many-core •61 cores •1 TFLOP/s GPU •2880 cores •1.3 TFLOP/s Complex control flow Number crunchingComplex control flow Number crunching 10
  • 11. • Wide variety of programming environments in HPC • OpenMP, CUDA, MPI, TBB,… • Our choice: OpenCL Heterogeneous systems NVIDIA SDKIntel SDKAMD SDK Write once Compile Run many 11
  • 12. Heterogeneous systems • Cross-platform portability != Performance portability • OpenCL: Abstraction layer • Solution 1: per-device hand-made tuning • Not portable at all • Solution 2: auto tuning • Rely on performance models 12
  • 13. Previous work • Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi • Comprehensive analysis of the literature since ~2007 • Organized as: Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216 Execution time estimation Bottleneck highlighting Power cons. estimation Simulators 13
  • 14. Types of Join 100 103 104 100 102 Inner Left Outer Right Outer Full Outer 100 100 100 100 103 - 104 - 100 100 - 102 100 100 103 - 104 - - 102 Table A Table B 14
  • 15. Algorithm • Biggest debate: Sort or Hash? Hash-join Complexity: Limitation: Extensive use of atomics prevent efficient parallelization O(n + m) Procedure: 1. Hash smaller table 2. Scan larger table Sort-join Sorting increases complexity O(n·log(n)) 1. Sort keys 2. Scan interleaved 15
  • 16. Algorithm • Step 1: Sort keys in both tables • Radix sort: speed/scalability sweet spot 100 104 103 103 102 100 100 102 101 102 100 100 102 103 103 104 100 101 102 102 Sort 16
  • 17. Algorithm • Step 2: Merge • Add non matching keys for outer joins 100 100 102 103 103 104 100 101 102 102 100 100 100 100 102 102 102 102 Table A Table B Result – Inner Join 17
  • 18. Implementation • Steps: 1)Develop a naive OpenCL implementation 2)Optimize per device type 3)Add a cost model for load balancing and partitioning • Experimental setup: • M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU • M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU • Baseline: ModernGPU (CUDA) 18
  • 20. Per-device tuning • Optimizations: • Thread scheduling • Memory management • Overheads: • Compilation • Memory allocation 20
  • 21. Optimizations • Per device thread scheduling OpenCL Kernel Threads: Groups: OpenCL Devices Four core CPU 0 1 2 3 61 core Xeon Phi 21 2 3 4 600 1
  • 22. • Per device memory management Optimizations Private Local Global OpenCL Device Memory Hierarchy Thread Thread-group Any thread 22 Scope: Registers On-chip RAM Registers RAM RAMRegisters RAM RAM
  • 23. Overheads • Compilation • Online compilation: X% of runtime (without I/O) • Memory allocation • Intel SDK: Y % of Merge Step in Xeon Phi OpenCL Program Host code Device code Compilation: Offline (gcc) Online (SDK) 23
  • 25. Future work 1) Finish tuning per device code 2) Test join in FPGA 3) Revisit partitioning strategy 4) Support multi-device execution • Develop a cost model that characterizes Join • Split the workload in runtime among existing devices 25
  • 26. Conclusions • Performance: device specific code • Performance portability: a) Platform specific code b) Parameterizable code • High OpenCL SDK dependence • Only portable debugging tool: printf • …but still the only portable framework • Future: OpenACC / OpenMP 4.0 ? 26
  • 27. Harnessing OpenCL in modern coprocessors Unai Lopez-Novoa unai.lopez@ehu.es 06 Aug 2014 Intelligent Systems Group University of the Basque Country UPV/EHU