This document compares GPU execution time prediction using machine learning techniques and analytical modeling. It begins with introductions to parallel programming models, GPU architectures, and machine learning techniques. It then describes testing methodology where algorithms were run on various NVIDIA GPUs and datasets were collected to compare machine learning approaches like linear regression and random forests to an analytical BSP-based model for GPU execution time prediction. The goal is to determine which approach more accurately predicts execution times.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
In Slovene: predavanje na daljavo: SUPERRAČUNALNIŠTVO V MARIBORU (izr. prof. dr. Aleš Zamuda), torek, 21. decembra 2021 ob 20:00 na daljavo preko MS Teams.
The event report for IEEE CIS11:
https://events.vtools.ieee.org/m/295510
Accelerating economics: how GPUs can save you time and moneyLaurent Oberholzer
Graphics processing units - or GPUs as they are more commonly known - are specialized circuits historically designed to efficiently handle computer graphics. They are highly parallel computers which can process large amounts of data simultaneously. The graphics algorithms for which GPUs have been designed and optimized share characteristics with other algorithms used in high-performance computing. For certain well-suited scientific applications, the GPU's infrastructure has been shown to achieve substantial speedups. For example, the evaluation of the Black-Scholes partial differential equation to price financial options has been found to be performed nearly 200 times faster in parallel on a GPU than serially on a single-core CPU (Buck 2006, “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”).
The main goal of this study is to illustrate how hybrid CPU/GPU systems can be used within computational economics to decrease the execution time of an implementation of a particular model. We start with a mainstream implementation of Raberto et al.'s (2001) Genoa Articial Stock Market ("GASM"), an agent-based model which simulates a financial market in discrete time in which heterogeneous agents trade a single asset. In order to ensure that it is well-suited for execution on the GPU, the algorithm used to clear the market according to the authors' specified mechanism is given a particular attention. Existing parallel programming interfaces - in particular the OpenACC standard and Thrust parallel algorithms library - are then deployed in the code. We aim to show:
- how the codebase of our GASM implementation is adapted to utilize these technologies;
- how incrementally offloading work to the GPU affects the execution time of our model;
- how this speedup varies as a function of the problem size (e.g. number of agents, number of time steps, etc.), i.e. weak scaling; and
- how parameterizing the work distribution within the OpenACC programming model to increase the number of execution units used impacts this speedup, i.e. strong scaling.
This study also aims at giving the reader a working knowledge of GPU-based parallel computing, and when and how it should be used.
Examples Implementing Black-Box Discrete Optimization Benchmarking Survey for...University of Maribor
PPSN XV: 15th International Conference on Parallel Problem Solving from Nature
Coimbra, Portugal, September 8–12, 2018
Session: Black Box Discrete Optimization Benchmarking (BB-DOB)
Saturday, 8 September, 14:00-15:30, Room 2.4
Aleš Zamuda, Goran Hrovat, Elena Lloret, Miguel Nicolau, Christine Zarges
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
In Slovene: predavanje na daljavo: SUPERRAČUNALNIŠTVO V MARIBORU (izr. prof. dr. Aleš Zamuda), torek, 21. decembra 2021 ob 20:00 na daljavo preko MS Teams.
The event report for IEEE CIS11:
https://events.vtools.ieee.org/m/295510
Accelerating economics: how GPUs can save you time and moneyLaurent Oberholzer
Graphics processing units - or GPUs as they are more commonly known - are specialized circuits historically designed to efficiently handle computer graphics. They are highly parallel computers which can process large amounts of data simultaneously. The graphics algorithms for which GPUs have been designed and optimized share characteristics with other algorithms used in high-performance computing. For certain well-suited scientific applications, the GPU's infrastructure has been shown to achieve substantial speedups. For example, the evaluation of the Black-Scholes partial differential equation to price financial options has been found to be performed nearly 200 times faster in parallel on a GPU than serially on a single-core CPU (Buck 2006, “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”).
The main goal of this study is to illustrate how hybrid CPU/GPU systems can be used within computational economics to decrease the execution time of an implementation of a particular model. We start with a mainstream implementation of Raberto et al.'s (2001) Genoa Articial Stock Market ("GASM"), an agent-based model which simulates a financial market in discrete time in which heterogeneous agents trade a single asset. In order to ensure that it is well-suited for execution on the GPU, the algorithm used to clear the market according to the authors' specified mechanism is given a particular attention. Existing parallel programming interfaces - in particular the OpenACC standard and Thrust parallel algorithms library - are then deployed in the code. We aim to show:
- how the codebase of our GASM implementation is adapted to utilize these technologies;
- how incrementally offloading work to the GPU affects the execution time of our model;
- how this speedup varies as a function of the problem size (e.g. number of agents, number of time steps, etc.), i.e. weak scaling; and
- how parameterizing the work distribution within the OpenACC programming model to increase the number of execution units used impacts this speedup, i.e. strong scaling.
This study also aims at giving the reader a working knowledge of GPU-based parallel computing, and when and how it should be used.
Examples Implementing Black-Box Discrete Optimization Benchmarking Survey for...University of Maribor
PPSN XV: 15th International Conference on Parallel Problem Solving from Nature
Coimbra, Portugal, September 8–12, 2018
Session: Black Box Discrete Optimization Benchmarking (BB-DOB)
Saturday, 8 September, 14:00-15:30, Room 2.4
Aleš Zamuda, Goran Hrovat, Elena Lloret, Miguel Nicolau, Christine Zarges
Ateji PX for Java introduces parallelism at the language level, extending the sequential base language with a small number of parallel primitives inspired from pi-calculus. This makes parallel programming simple and
intuitive, easy to learn, efficient, provably correct and compatible with existing code, tools and
development processes.
Manufacturing Execution System for Industry-I am pleased to share details about our successfully working model, as how we can provide you with innovative & industry proven Plant Intelligence Solutions for Automotive Manufacturing Plant like yours to give you following benefits in Real Time Environment:
• Informed decisions based on Data Analytics
• Streamlined and Optimized Operations
• Improved Productivity
• Reduce Total Defects
• Reduced Inventory
• Lean, “Smart” MES approach and application coverage for low TCO
• Improved return on assets and investments (ROA/ ROI)
• Improved Equipment Up-Time
• Improved responsiveness, improved plant throughput time
• Enhanced Real Time visibility into production data
We have successfully served as per expectations of many End-Users in Manufacturing, Food & Beverage, Pharma, Oil & Gas, Petrochemical , Cement, Power, & metals Industry. We have more than 5000+ software installations throughout India with proven track record in almost every industry vertical and have delivered projects to 40+ countries in every continent including Americas, Europe, Asia, Africa, and Australia.
Junli Gu at AI Frontiers: Autonomous Driving RevolutionAI Frontiers
Autonomous driving has gain enormous attention and momentum in the past year, due to its potential huge impact on car industry. Junli Gu's talk summarizes the current trends and on-going efforts of driver-less cars. Her talk highlights the technical challenges and share some insights in how machine learning might lead us to the path.
This paper presentsa novel data flow architecturethat utilizes data from engineering simulations to
generate a reduced order model within Apache Spark. The reduced order model from Spark is then utilized by
anevolutionary algorithm in the optimization of an industrial system component. This work is presented in the
context of the shape optimization of a heat exchanger fin and demonstrates the ability of theengineering
simulation, the reduced order model and the evolutionary algorithm to exchange data with each other by
utilizing Spark as the common data-processing framework. In order to enable a user to monitor the input design
parameter space,self-organizing maps are generated for visualization. The results of theevolutionary
optimization utilizing this data flow are compared with results from invoking high-fidelity engineering
simulations. This novel data flow architecture decouples the evolutionary algorithm from the reduced order
model and allows improvement of the optimization results by continuously augmenting the reduced order model
with data from the evolutionary algorithm.Additionally, when constraints on the optimization algorithm are
modifiedthe evolutionary algorithm canadapt and evolve good solutions. Themethodology presented in this
articlealso makes it feasible to simultaneously tune evolutionary optimization experiments along with
engineering simulations at a relatively low computational cost.
Ateji PX for Java introduces parallelism at the language level, extending the sequential base language with a small number of parallel primitives inspired from pi-calculus. This makes parallel programming simple and
intuitive, easy to learn, efficient, provably correct and compatible with existing code, tools and
development processes.
Manufacturing Execution System for Industry-I am pleased to share details about our successfully working model, as how we can provide you with innovative & industry proven Plant Intelligence Solutions for Automotive Manufacturing Plant like yours to give you following benefits in Real Time Environment:
• Informed decisions based on Data Analytics
• Streamlined and Optimized Operations
• Improved Productivity
• Reduce Total Defects
• Reduced Inventory
• Lean, “Smart” MES approach and application coverage for low TCO
• Improved return on assets and investments (ROA/ ROI)
• Improved Equipment Up-Time
• Improved responsiveness, improved plant throughput time
• Enhanced Real Time visibility into production data
We have successfully served as per expectations of many End-Users in Manufacturing, Food & Beverage, Pharma, Oil & Gas, Petrochemical , Cement, Power, & metals Industry. We have more than 5000+ software installations throughout India with proven track record in almost every industry vertical and have delivered projects to 40+ countries in every continent including Americas, Europe, Asia, Africa, and Australia.
Junli Gu at AI Frontiers: Autonomous Driving RevolutionAI Frontiers
Autonomous driving has gain enormous attention and momentum in the past year, due to its potential huge impact on car industry. Junli Gu's talk summarizes the current trends and on-going efforts of driver-less cars. Her talk highlights the technical challenges and share some insights in how machine learning might lead us to the path.
This paper presentsa novel data flow architecturethat utilizes data from engineering simulations to
generate a reduced order model within Apache Spark. The reduced order model from Spark is then utilized by
anevolutionary algorithm in the optimization of an industrial system component. This work is presented in the
context of the shape optimization of a heat exchanger fin and demonstrates the ability of theengineering
simulation, the reduced order model and the evolutionary algorithm to exchange data with each other by
utilizing Spark as the common data-processing framework. In order to enable a user to monitor the input design
parameter space,self-organizing maps are generated for visualization. The results of theevolutionary
optimization utilizing this data flow are compared with results from invoking high-fidelity engineering
simulations. This novel data flow architecture decouples the evolutionary algorithm from the reduced order
model and allows improvement of the optimization results by continuously augmenting the reduced order model
with data from the evolutionary algorithm.Additionally, when constraints on the optimization algorithm are
modifiedthe evolutionary algorithm canadapt and evolve good solutions. Themethodology presented in this
articlealso makes it feasible to simultaneously tune evolutionary optimization experiments along with
engineering simulations at a relatively low computational cost.
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
Machine Learning (ML) is a subset of Artificial Intelligence (AI). In this talk, Richard Garris, Principal Architect at Databricks will explain how various ML algorithms are parallelized in Apache Spark. Andrew Ng calls the algorithms the "rocket ship" and the data "the fuel that you feed machine learning" to build deep learning applications. We will start with an understanding of machine learning pipelines built using single machine algorithms including Pandas, scikit-learn, and R. Then we will discuss how Apache Spark MLlib can be used to parallelize your machine learning pipeline with Linear Regression and Random Forest. Lastly, we will discuss ways to parallelize single machine algorithms in Spark by broadcasting the data and then performing distributed feature selection, model creation or hyperparameter tuning.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScscpconf
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapiński and Barnes.
Great Paper on HSAemu Full system simulator built form PQUEMU to do Full System Emulation of HSA from our Academic Member Yeh-Ching Chung of National Tsing Hua University
We all know good training data is crucial for data scientists to build quality machine learning models. But when productionizing Machine Learning, Metadata is equally important. Consider for example:
- Provenance of model allowing for reproducible builds
- Context to comply with GDPR, CCPA requirements
- Identifying data shift in your production data
This is the reason we built ArangoML Pipeline, a flexible Metadata store which can be used with your existing ML Pipeline.
Today we are happy to announce a release of ArangoML Pipeline Cloud. Now you can start using ArangoML Pipeline without having to even start a separate docker container.
In this webinar, we will show how to leverage ArangoML Pipeline Cloud with your Machine Learning Pipeline by using an example notebook from the TensorFlow tutorial.
Find the video here: https://www.arangodb.com/arangodb-events/arangoml-pipeline-cloud/
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
Highlighted notes of article while studying Concurrent Data Structures, CSE:
Is Multicore Hardware For General-Purpose Parallel Processing Broken?
By Uzi Vishkin
Communications of the ACM, April 2014, Vol. 57 No. 4, Pages 35-39
10.1145/2580945
BSSML16 L8. REST API, Bindings, and Basic WorkflowsBigML, Inc
Brazilian Summer School in Machine Learning 2016
Day 2 - Lecture 3: REST API, Bindings, and Basic Workflows
Lecturer: Dr. José Antonio Ortega - jao (BigML)
MLSEV Virtual. ML Platformization and AutoML in the EnterpriseBigML, Inc
Machine Learning Platformization and AutoML in the Enterprise, by Ed Fernández, Board Director at Arowana International.
This presentation focuses on the adoption of Machine Learning platforms and AutoML in the Enterprise, the challenges around DevOps and MLOps, latest market trends, future evolution and the impact of AutoML for rapid prototyping of Machine Learning models.
*MLSEV 2020: Virtual Conference.
Similar to SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling (20)
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
SlidesA Comparison of GPU Execution Time Prediction using Machine Learning and Analytical Modeling
1. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
A Comparison of GPU Execution Time Prediction using
Machine Learning and Analytical Modeling
Ph.D(c) CS Marcos Amar´ıs Gonz´alez
Advisor: Dr. Alfredo Goldman vel Lejbman
Co-advisor: Dr. Raphael Yokoingawa de Camargo
December, 2016
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
2. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Timeline
1 Introduction and Motivation
2 Parallel Programming Models
BSP-based Analytical Model for GPUs
3 Machine Learning Techniques
4 Comparison
Methodology
Results
Conclusions and Future Works
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
3. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 1 / 32
4. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Games and Video Cards
80’ - First video driver
Evolution of the games 3D. It is nec-
essary to apply textures, lights, shad-
ows, reflections, etc.
It was also necessary more computing
power
For this, the video cards became to
be more flexible and powerful
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 2 / 32
5. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Graphic Processing Units - GPUs
The term GPU was popularized by Nvidia in
1999, who invented a GeForce 256 like the first
GPU in the world.
In 2002 the first General Purpose GPU was
launched. The term GPGPU was created by
Mark Harris.
The main manufacturer of GPUs are NVIDIA
and AMD. In 2005 NVIDIA launched CUDA.
Deep Learning, Virtual Reality.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 3 / 32
6. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
General Purpose GPU - GPGPU
Main program execute in the CPU (host) and it is responsible to start the execution
in the GPU (device).
These GPUs have their own hierarchy of memory and data must be transfered
through the PCI Express.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 4 / 32
7. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Versus CPU
Nowadays GPUs are capable to perform much more efficient computing
operations than CPUs multicores.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 5 / 32
8. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
CUDA, GPUs and Memory spaces
A GPU has many processors P,
all processors have the same clock
rate R and they are divided in
Multiprocessors.
A CUDA Kernel can be composed
of thousands and/or millions of
threads t.
Type On Chip Cacheable Instructions Visibility g Latency
Registers Yes No Load/Store Thread 1 cycle
Shared-L1 Yes No Load/Store Block 5 cycles
Constant No Yes Load Kernels 100 cycles
Texture No Yes Load/Store Kernel 100 cycles
Local No Yes Load/Store Thread 100 cycles
Cache L2 No Yes Load/Store Kernel 250 cycles
Global No Yes Load/Store Kernel 500 cycles
Table: Memory types in GPUs supported by CUDA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 6 / 32
9. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
In modern GPUs the comsumption of energy is a important restriction.
Projects of GPUs are generally highly scalable.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 7 / 32
10. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
RoadMap of architectures of GPUs NVIDIA
Compute Capability is a diferentiation between architectures and models of
GPUs NVIDIA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 8 / 32
11. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Compute Unified Device Architecture
CUDA - Compute Unified Device Architecture
CUDA is a extention of the language C, it allows to control the execution of grids
in a GPU and manages its memory.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 9 / 32
12. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
GPU Programming Model
A GPU Aplication is organized in grids, blocks and threads. Threads are grouped
in blocks and they are grouped in a grid.
Linear translation to know the Id of a thread in a grid.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 10 / 32
13. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Supercomputers
Intel Core i7 990X: 6 cores, US$ 1000 Theoretical maximum performance 0.4 TFLOP
GTX680: 1500 cores and 2GB, pre¸co US$500 Theoretical maximum performance 3.0 TFLOP
Accelerators and co-processors in the ranking top 500 Supercomputers more powerful of the world
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 11 / 32
14. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Top 500 Green Supercomputers $$$$$$
Ranking of the supercomputers more efficient energetically in the world.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
15. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 12 / 32
16. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Amdahl’s law and Flynn’s Taxonomy
Flynn’s Taxonomy - 1966
Single Instruction Multiple Instruction
Single Data SISD - Sequential MISD
Multiple Data SIMD [SIMT] - GPU MIMD - Multicore
Amdahl’s law - 1967
Amdahl’s law gives the theoretical speedup of the execution of a task at fixed
workload that can be expected of a system whose resources are improved.
Speedup:
S = Speed-up
P = Number of Processors
T = Time
Sp =
T1
Tp
(1)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 13 / 32
17. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Parallel Random Access Machine (PRAM)
Figure: PRAM Model
It ignores lower level architectural constraints, and details, such as memory
access contention and overhead, synchronization overhead, interconnection
network throughput, connectivity, speed limits and link bandwidths, etc.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 14 / 32
18. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Figure: Super-step in the BSP model
The cost to execute the i-th super-step is
then given by:
wi + ghi + L (2)
The total execution time of the applica-
tion is given by:
T = W + gH + LS (3)
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 15 / 32
19. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Bulk Synchronous Parallel Model
Bulk Synchronous Parallel (BSP), introduced by
Valiant in 1990 Turing Award 2010.
High Level model for parallelism
Computation and communication of a Kernel
function
We did not include the synchronization step, nei-
ther communication with host memory
Optimization aspects are modeled by adjusting
a single parameter λ
Leslie Valiant
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 16 / 32
20. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Analytical Model Published
Divergence, optimizations in the communication and differences between
architecture are adjusted by one parameter, λ1
Tk =
t · (Comp + CommSM + CommGM)
R · P · λ
(4)
CommGM = (ld1 + st1 − L1 − L2) · gGM + L1 · gL1 + L2 · gL2 (5)
CommSM = (ld0 + st0) · gSM (6)
comp, ld0, st0, ld1 and st1 are obtained on the source code.
L1 and L2 Cache hits are captured by profiling.
1
M. Amaris, D. Cordeiro, A. Goldman, and R. Y. Camargo, “A simple bsp-based model to
predict execution time in gpu applications,” in 22nd Int’l Conference on HPC, December 2015
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
21. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 17 / 32
22. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Machine Learning Techniques
The theoretical subject of “learning” is related to prediction.
Supervised Learning
Unsupervised Learning
3 different Machine Learning Techniques
Simple Linear Regression (LR)
Support Vector Machines (SVM)
Random Forest (RF)
In this work, we wanted to use simple models to prove that they achieve
reasonable predictions.
Fair comparison: (Data Input - Profile Information).
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 18 / 32
23. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Linear Regression (LR)
It assumes that there is approximately a linear relationship between each Xp
and Y . Mathematically, we can write the multiple linear regression model
as
Y ≈ β0 + β1X1 + +β2X2 + . . . + +βpXp + (7)
where Xp represents the pth predictor and βp quantifies the association
between that variable and the response.
Figure: Example of a Linear Regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 19 / 32
24. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-
rithms that depend on the data only through dot-products. The dot product
can be replaced by a kernel function which computes a dot product in some
possibly high dimensional feature space Z. It maps the input vector x into
the feature space Z.
Figure: Example of Linear and no linear kernel for SVM in classification
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
25. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Support Vector Machines (SVM)
SVM belongs to the general category of kernel methods, which are algo-
rithms that depend on the data only through dot-products. The dot product
can be replaced by a kernel function which computes a dot product in some
possibly high dimensional feature space Z. It maps the input vector x into
the feature space Z.
Figure: Example of Linear and no linear kernel for SVM in regression
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 20 / 32
26. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Random Forest (RF)
Random Forests belong to decision tree methods, capable of performing
both regression and classification tasks.
Figure: Diagram of a tree decision
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
27. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
BSP-based model Vs. Machine Learning
1 Introduction and Motivation
2 Parallel Programming Models
3 Machine Learning Techniques
4 Comparison
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 21 / 32
29. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Algorithm Testbed
9 different applications
Matrix Multiplications in 4 different optimizations:
* Global Memory - MMGU
* Global Memory with coalesced accesses - MMGC
* Global and Shared Memory - MMSU
* Global and shared Memory with coalesced accesses - MMSC
Matrix Addition in 2 different optimizations:
* Global Memory - MAU
* Global Memory with coalesced accesses - MAC
Dot Product - dotP
Vector Addition - vAdd
Maximum Subarray Problem - MSA
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 23 / 32
30. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Dataset
10 Times each sample, with a confidence interval of 95%.
First Scenario - Machine Learning Vs Machine Learning
1st MMSC with Block size 42, 82, 122, 162, 202, 242, 282, and 322. 256 samples per GPU.
More 2000 Samples.
Second Scenario - Analytical Model Vs Machine Learning
Analytical Model
1D App. with input sizes from 218 until 227. 10 per GPU. 90 Samples.
2D App. with input sizes from 28 to 213. 6 per GPU. 54 Samples
Machine Learning - Block size 82, 162 and 322.
1D App. with input sizes from 218 to 227. 207 per GPU. 1863 Samples.
2D App. with input sizes from 28 to 213. 96 per GPU. 864 Samples
MSA Blocksize 128. 96 samples per GPU. 864 Samples.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 24 / 32
31. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Features of the Machine Learning Techniques
13 features were used to feed the Machine learning Techniques.
Feature Description
num of cores Number of cores per GPU
max clock rate GPU Max Clock rate
Bandwidth Theoretical Bandwidth
Input Size Size of the problem
totalLoadGM Load transaction in Global Memory
totalStoreGM Store transaction in Global Memory
TotalLoadSM Load transaction in Shared Memory
TotalStoreSM Store transaction in Global Memory
FLOPS SP Floating operation in Single Precision
BlockSize Number of threads per blocks
GridSize Number of blocks in the kernel
No. threads Number of threads in the applications
Achieved Occupancy
Ratio of the average active warps per active cycle to the maximum
number of warps ed on a multiprocessor.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 25 / 32
32. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Use Cases of the Analytical Model
Par.
Matrix Multiplication Matrix Addition
vAdd dotP MSA
MMGU MMGC MMSU MMSC MAU MAC
comp N· FMA 1 · 24 1 · 96 (N/t) · 100
ld1 2 · N 2 2 N/t
st1 1 1 1 5
ld0 0 2 · N 0 0 N/t
st0 0 1 0 1 + log(t) 5
q
q
q
q
q
0
10
20
30
40
50
60
70
80
90
100
110
120
130
MMGU MMGC MMSU MMSC MAU MAC dotP vAdd MSA
Applications
LambdaValues
Lambda Values of each one of the Applications
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 26 / 32
33. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Log transformation
We first transformed the data to a log2 scale and, after performing the
learning and predictions, we returned to the original scale using a 2pred
transformation2, reducing the non-linearity effects.
Figure: Quantile-Quantile Analysis of the generated models
2
B. J. Barnes, et al. “A regression-based approach to scalability prediction,” in Proceedings
of the 22Nd Annual Int’l Conference on Supercomputing, ser. ICS ’08. New York, NY, USA.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 27 / 32
34. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Results Machine Learning - 1st Scenario
Tesla K40
Tesla K20
Quadro
Titan
TitanBlack
TitanX
GTX 680
GTX 980
GTX 970
●●●
●●
●
●
●●
●
●
●
●●●●●●
●●
●●
●●●●
●
●
●●●
●
●●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Linear Regression of MMSC
●●
●●●●●●●●●●●●●●
●●
●
●●
●●●●●●●●●●●●●●
●
●
●
●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Support Vector Machines of MMSC
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●●●●●
●●●●
●●●
●●●●
0.0
0.5
1.0
1.5
2.0
2.5
AccuracyTkTm
Random Forest of MMSC
Figure: Accuracy of Machine Learning Algorithms of matMul-SM-Coalesced with
many samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 28 / 32
37. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Conclusions
Fair comparison.
Analytical model requires calculations
Machine learning provides more flexibility and generalization
Linear Regression can do reasonable predictions
But, ML requires a lot of label samples
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 31 / 32
38. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Future Works
Irregular benchmarks (Rodinia, SHOC).
Multiple kernels our GPUS and global synchronization
One extra memory level, the CPU RAM.
Feature extraction.
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32
39. Introduction and Motivation Parallel Programming Models Machine Learning Techniques Comparison
Thanks for your attention
Repository of the work:
https://github.com/marcosamaris/svm-gpuperf
(gold, amaris)@ime.usp.br (IME - USP) BSP-based model Vs. Machine Learning December, 2016 32 / 32