Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
Harnessing OpenCL in Modern Coprocessors
1. Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU
2. Outline
• Previous work
• Work @ UniMan: Relational Join
1.Motivation
2.Algorithm
3.Results
4.Conclusions
2
3. About Myself
• PhD Student @ Intelligent Systems Group: 2011 – Now
• Research interest: Efficient use of Modern coprocessors
• Performance modeling
• Code acceleration
• Development of parallel implementations
• Molecular Dynamics simulation code (MSc thesis)
• Kernel Density Estimation (Under review)
• Relational Join (Work @ UniMan)
3
4. Kernel Density Estimation
• Estimate the Probability Density Function of a population
• Our use case: Climate models
• Challenge: large volumes of data
4
Histogram: KDE:
5. Kernel Density Estimation
• 1st
: Algorithmic rework
• 2nd
: Parallel implementation: multi/many core processors
• Compared to R+MKL and CUDA implementations
Naive approach
for each evaluation_point e
for each sample s
d = distance(e,s)
e += density (d)
Our approach
B = computeBoundingBox()
for each sample s
b = fitBoundingBox(B,s)
for each e_point e in b
d = distance(e,s)
e += density (d)
5
7. Join
Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data
partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM,
2013.
Do sunblock sales correlate with weather?
Sales
Weather
Join-Date(Sales,Weather)
Join-Date
7
9. Join
Goal: Develop a parallel implementation of relational
join targeting nowadays heterogeneous systems
9
10. Heterogeneous systems
• Performance depends on the nature of the application
Multi-core
•16 cores
•250 GFLOP/s
Many-core
•61 cores
•1 TFLOP/s
GPU
•2880 cores
•1.3 TFLOP/s
Complex control flow Number crunchingComplex control flow Number crunching
10
11. • Wide variety of programming environments in HPC
• OpenMP, CUDA, MPI, TBB,…
• Our choice: OpenCL
Heterogeneous systems
NVIDIA SDKIntel SDKAMD SDK
Write once
Compile
Run many
11
12. Heterogeneous systems
• Cross-platform portability != Performance portability
• OpenCL: Abstraction layer
• Solution 1: per-device hand-made tuning
• Not portable at all
• Solution 2: auto tuning
• Rely on performance models
12
13. Previous work
• Collection of performance modeling proposals for latest
GPUs and Intel Xeon Phi
• Comprehensive analysis of the literature since ~2007
• Organized as:
Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation
Techniques for Accelerator-based Computing IEEE Transactions on
Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216
Execution time
estimation
Bottleneck
highlighting
Power cons.
estimation
Simulators
13
14. Types of Join
100
103
104
100
102
Inner Left Outer
Right Outer Full Outer
100 100 100 100
103 -
104 -
100 100
- 102
100 100
103 -
104 -
- 102
Table A
Table B
14
25. Future work
1) Finish tuning per device code
2) Test join in FPGA
3) Revisit partitioning strategy
4) Support multi-device execution
• Develop a cost model that characterizes Join
• Split the workload in runtime among existing devices
25
26. Conclusions
• Performance: device specific code
• Performance portability:
a) Platform specific code
b) Parameterizable code
• High OpenCL SDK dependence
• Only portable debugging tool: printf
• …but still the only portable framework
• Future: OpenACC / OpenMP 4.0 ?
26
27. Harnessing OpenCL in
modern coprocessors
Unai Lopez-Novoa
unai.lopez@ehu.es
06 Aug 2014
Intelligent Systems Group
University of the Basque Country UPV/EHU