This document discusses using machine learning techniques to perform runtime selection of CPUs or GPUs for executing Java programs. It describes challenges in supporting Java features like exceptions on GPUs and accelerating Java programs. Features like loop characteristics, instruction counts, memory accesses are extracted from programs to train an SVM model to predict faster device. Evaluating on 11 apps, the model achieves 97.6-99% accuracy using 5-fold cross validation to avoid overfitting. This runtime selection approach can adapt to new hardware without needing to rebuild performance models.
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
Jack Han presented on profiling deep learning networks using NVIDIA tools. He discussed annotating PyTorch models with NVTX to identify bottlenecks, optimizing data loading in PyTorch, and achieving a 4x speedup on BERT by using mixed precision and Tensor Cores. He also covered profiling TensorFlow graphs with NVTX plugins and command examples for profiling multi-GPU applications with Nsight Systems.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
The document evaluates the performance of parallelizing a naive matrix multiplication algorithm using OpenMP on a multicore processor. It implements sequential and parallel versions of square matrix multiplication and measures execution time. Results show that as the number of threads and matrix size increase, execution time decreases and speedup and efficiency increase, demonstrating that the parallel implementation outperforms the serial version, especially for larger datasets.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
Jack Han presented on profiling deep learning networks using NVIDIA tools. He discussed annotating PyTorch models with NVTX to identify bottlenecks, optimizing data loading in PyTorch, and achieving a 4x speedup on BERT by using mixed precision and Tensor Cores. He also covered profiling TensorFlow graphs with NVTX plugins and command examples for profiling multi-GPU applications with Nsight Systems.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
The document evaluates the performance of parallelizing a naive matrix multiplication algorithm using OpenMP on a multicore processor. It implements sequential and parallel versions of square matrix multiplication and measures execution time. Results show that as the number of threads and matrix size increase, execution time decreases and speedup and efficiency increase, demonstrating that the parallel implementation outperforms the serial version, especially for larger datasets.
The document evaluates the performance of parallelizing a naive matrix multiplication algorithm using OpenMP on a multicore processor. It implements sequential and parallel versions of square matrix multiplication and measures execution time. Results show that as the number of threads and matrix size increase, execution time decreases and speedup and efficiency increase, demonstrating that the parallel implementation outperforms the serial version, especially for larger datasets.
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
This document discusses various ways to accelerate applications using GPUs and CUDA programming. It provides examples of using libraries, programming languages like OpenACC and CUDA, and tools like Nsight to add GPU acceleration. It also highlights many success stories and how applications from fields like HPC, deep learning, and computational chemistry have achieved speedups using these techniques. Resources and compilers are available to help developers get started with GPU programming.
This document summarizes key features introduced in Java SE 5.0 (Tiger) including generics, autoboxing/unboxing, enhanced for loops, type-safe enums, varargs, static imports, and annotations. It also discusses performance enhancements in the virtual machine as well as new concurrency utilities like Executors and ScheduledExecutorService that make multi-threaded programming easier and more robust.
JavaOne 2016: Code Generation with JavaCompiler for Fun, Speed and Business P...Juan Cruz Nores
On-the-fly bytecode generation is generally known to be super efficient, but also super difficult to implement and debug. Instead of trying to generate bytecode for the JVM, you can leverage the built-in Java compiler; generate Java code as a string, compile that to bytecode and then have that executed. This gives you better code efficiency, is easier to implement, and is straight-forward to debug. We’ll cover on-the-fly code generation, execution and debugging, working with HotSpot and G1 using dynamic code, as well as how to optimize for engineer implementation time; maximum gain in minimum time. We’ll use practical examples and code snippets, so you can be ready to make the core processing for your business 10x faster.
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
This document discusses optimizations for high performance and energy efficient implementations of the Smith-Waterman algorithm on FPGAs using OpenCL. It describes an architecture with a systolic array for parallel computation along anti-diagonals and compression techniques to address the memory-bound nature. Experimental results on two FPGA boards show up to 42.5 GCUPS performance with the best performance/power ratio compared to CPUs and other FPGA implementations.
Hybrid CPU GPU MATLAB Image Processing BenchmarkingDimitris Vayenas
An attempt to quantify the substantial performance improvement observed on Windows 8.1\ Nvidia GTX 780M\Intel HD 4600 via the latest NVIDIA Driver (326.01) that may help other users - particularly of the MATLAB Image Processing and Parallel Computing Toolboxes - to consider upgrading...
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
The document summarizes the author's participation report at the IEEE CloudCom 2014 conference. Some key points include:
- The author attended sessions on virtualization and HPC on cloud.
- Presentations had a strong academic focus and many presenters were Asian.
- Eight papers on HPC on cloud covered topics like reliability, energy efficiency, performance metrics, and applications like Monte Carlo simulations.
This document proposes Direct Code Execution (DCE), an approach to improve the functional realism of network simulators while maintaining timing realism and debuggability. DCE runs real network application and kernel code within a simulator by virtualizing it as multiple nodes in a single process. This allows experiments to be fully reproducible while debugging distributed implementations. The approach is evaluated using case studies on multipath TCP and wireless handoffs that demonstrate improved realism over emulation and the ability to debug distributed applications.
The document discusses optimizing parallelism in NumPy-based programs. It provides examples of optimizing a main function from 50.1 ms to 2.83 ms using profiling and optimization. It discusses approaches for performant numerical code including vectorization and Python compilers. It also covers issues with oversubscription when using all CPU cores and parallel APIs in NumPy, SciPy, and scikit-learn. The document provides recommendations for tuning default parallel behavior and controlling parallelism in packages.
C++ open positions and popularity remain high as media has recently, and there is a reason for that: from the many languages and platforms that developers have available today, C++ features uncontested capabilities in power and performance, allowing innovation outside the box (just think on action games, natural user interfaces or augmented reality, to mention some). In this talk you’ll see the new features and technologies that are coming with Visual C++ vNext, helping you build compelling applications with a renewed developer experience. Don’t miss it!!
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Matrix Multiplication with Ateji PX for JavaPatrick Viry
Matrix multiplication is a standard benchmark for evaluating the performance of intensive dataparallel operations on recent multi-core processors. This whitepaper shows to use Ateji PX for Java to achieve state-of-the-art parallel performance, by adding one single operator in your existing code.
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
This document presents CNNECST, an automated framework for hardware acceleration of convolutional neural networks (CNNs) on FPGAs. The framework bridges machine learning frameworks and FPGA design through high-level APIs, an intermediate representation, and C++ libraries. It was tested on two datasets and FPGAs, achieving speedups of 45x over CPUs and higher energy efficiency, while maintaining accuracy. Challenges include supporting more layer types and reduced precision data.
Similar to Machine-learning based performance heuristics for Runtime CPU/GPU Selection in Java (20)
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Polyhedral compilation uses the polyhedral model to represent programs as systems of affine inequalities over iteration variables. This allows loop transformations like fusion, distribution, skewing and reversal to be expressed as affine mappings on the iteration space. The key aspects are representing the iteration domain, scheduling functions that determine the execution order of statements, and memory accesses in terms of iteration vectors. Loop transformations are specified by changing the scheduling functions to map iterations to new logical execution times while preserving semantics. This enables optimizations at the level of whole programs or subprograms.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
This document discusses research on automatic parallelization for heterogeneous and homogeneous multicore processors. It presents Akihiro Hayashi's PhD defense at Waseda University on this topic. It motivates the need for automatic parallelization due to difficulties in programming multicore processors. It proposes a solution called OSCAR that uses a heterogeneous multicore compiler with APIs to enable automatic parallelization across different processor types. The methodology involves hint directives, parallelization of tasks, power reduction techniques, and generation of executables. It evaluates the approach on media applications using a Renesas multicore processor.
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
This document summarizes research on LLVM optimizations for PGAS (Partitioned Global Address Space) programs like Chapel. It discusses generating LLVM IR from Chapel to enable optimizations like LICM (Loop Invariant Code Motion). Evaluations show LLVM optimizations remove many communication operations and improve performance for some applications vs. C code generation. However, LLVM constraints and wide pointer overhead hurt performance for other applications. Future work includes more applications, possibly-remote to definitely-local transformations, and parallel intermediate representations in LLVM.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
4. Explicit Parallelism with Java
High-level parallel programming with Java
offers opportunities for
preserving portability
Low-level parallel/accelerator programming is not required
enabling compiler to perform parallel-aware
optimizations and code generation
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4
5. Challenges for
GPU Code Generation
5
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
Exception Semantics
Virtual Method Calls
Kernel Optimizations
DT Optimizations
Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
vs.
6. JIT Compilation for GPU
IBM Java 8 Compiler
Built on top of the production version of the
IBM Java 8 runtime environment
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
7. The JIT Compilation Flow
7
Java
Bytecode
Translation to
Our IR
Parallel
Streams
Identification
Optimization
for GPUs
NVVM IR libNVVM
Existing
Optimizations
Target Machine
Code Generation
GPU
Binary
CPU
Binary
GPU Runtime
Our JIT Compiler
NVIDIA’s Tool Chain
8. Java Features on GPUs
Exceptions
Explicit exception checking on GPUs
ArrayIndexOutOfBoundsException
NullPointerException
ArithmeticException (Division by zero only)
Further Optimizations
Loop Versioning for redundant exception checking elimination
on GPUs based on [Artigas+, ICS’00]
Virtual Method Calls
Direct De-virtualization or Guarded De-virtualization
[Ishizaki+, OOPSLA’00]
8
Challenge 1 : Supporting Java Features on GPUs
9. Loop Versioning Example
9
IntStream.range(0, N)
.parallel()
.forEach(
i -> {
Array[i] = i / N;
}
);
// Speculative Checking on CPU
if (Array != null
&& 0 <= Array.length
&& N <= Array.length
&& N != 0) {
// Safe GPU Execution
} else {
May cause
OutOfBoundsException
May cause
ArithmeticException
May cause
NullPointerException
Challenge 1 : Supporting Java Features on GPUs
10. Optimizing GPU Execution
Kernel optimization
Optimizing alignment of Java arrays on GPUs
Utilizing “Read-Only Cache” in recent GPUs
Data transfer optimizations
Redundant data transfer elimination
Partial transfer if an array subscript is affine
10
Challenge 2 : Accelerating Java programs on
GPUs
11. The Alignment Optimization Example
11
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for
a[0:31]
a[64]-a[95]
a[64]-a[95]
Challenge 2 : Accelerating Java programs on
GPUs
14. Related Work : Linear Regression
Regression based cost estimation is specific to an application
14
0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec) NVIDIA Tesla K40 GPU
IBM POWER8 CPU
App 1) BlackScholes
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime
(msec)
App 2) Vector Addition
CPU
GPU
CPU
GPU
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
Challenge 3 : Performance Heuristics
15. Accurate Cost Model is required?
Accurate cost model construction would be
too much
Considerable effort will be needed to update
performance models for future generations of
hardware
15
Multi-
core
CPUs
Many-
core
GPUs
Which one is faster?
Machine-learning-based performance heuristics
Challenge 3 : Performance Heuristics
16. ML-based Performance Heuristics
A binary prediction model is constructed by supervised
machine learning with support vector machines
(SVMs)
16
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPUGPU
Challenge 3 : Performance Heuristics
17. Features of program
Loop Range (Parallel Loop Size)
The dynamic number of Instructions in IR
Memory Access
Arithmetic Operations
Math Methods
Branch Instructions
Other Instructions
17
Challenge 3 : Performance Heuristics
18. Features of program (Cont’d)
The dynamic number of Array Accesses
Coalesced Access (a[i])
Offset Access (a[i+c])
Stride Access (a[c*i])
Other Access (a[b[i]])
Data Transfer Size
H2D Transfer Size
D2H Transfer Size
18
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
Challenge 3 : Performance Heuristics
20. Prediction Model Construction
Obtained 291 samples by running 11 applications with
different data sets
Choice is either GPU or 160 worker threads on CPU
20
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSV
M
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Training run with JIT Compiler Offline Model Construction
Challenge 3 : Performance Heuristics
21. Applications
21
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix Multiplication 2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
Challenge 3 : Performance Heuristics
22. Platform
CPU
IBM POWER8 @ 3.69GHz
20 cores
8 SMT threads per core = up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
22
Challenge 3 : Performance Heuristics
24. How to evaluate prediction models:
5-fold cross validation
Overfitting problem:
Prediction model may be tailored to the eleven applications
if training data = testing data
To avoid the overfitting problem:
Calculate the accuracy of the prediction model using 5-fold cross
validation
24
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5Used for TESTING
Accuracy : X%
Used for TESTING
Accuracy : Y%
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5
Challenge 3 : Performance Heuristics
25. 79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
25
Higher is better
Challenge 3 : Performance Heuristics
not distinguish
insn types
distinguish
insn types +=Data Transfer Size
26. Discussion
Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
DT does not contribute to improving the accuracy since there is a
few samples where the DT contributes the decision
Pros. and Cons.
(+) Future generations of hardware can be supported easily by
re-running applications
(+) Just add another training data to rebuild a prediction model
(-) Takes time to collect training data
26
Challenge 3 : Performance Heuristics
27. Related Work: Java + GPU
27
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall / lambda Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
28. Summary
28
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
Performance
Heuristics
Challenge 2
Accelerating Java
programs on GPUs
Exception Semantics
Virtual Method Calls
Kernel Optimizations
DT Optimizations
Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
vs.
Over 1,000x Speedup Up to 99% accuracyExceptions & Virtual Calls
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
29. Acknowledgements
Special thanks to
IBM CAS
Marcel Mitran (IBM Canada)
Jimmy Kwa (IBM Canada)
Habanero Group at Rice
Yoichi Matsuyama (CMU)
29
30. Further Readings
Code Generation and Optimizations
“Compiling and Optimizing Java 8 Programs for GPU
Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents,
Vivek Sarkar. 24th International Conference on Parallel
Architectures and Compilation Techniques (PACT), October
2015.
Performance Heuristics
“Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki
Ishizaki, Gita Koblents, Vivek Sarkar. 12th International
Conference on the Principles and Practice of Programming on
the Java Platform: virtual machines, languages, and tools
(PPPJ), September 2015.
30
33. How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
TRUE POSITIVE
Correctly predicted that CPU is faster
TRUE NEGATIVE
Correctly predicted that GPU is faster
FALSE POSITIVE
Predicted CPU is faster, but GPU is actually faster
FALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
33
34. How to evaluate prediction models:
Accuracy, Precision, and Recall metrics
Accuracy
the percentage of selections predicted correctly:
(TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU)
Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
Precision GPU : TN / (TN + FN)
Recall X (X = CPU or GPU)
Recall CPU : TP / (TP + FN)
# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
Recall GPU : TN / (TN + FP)
34
How precise is the
model when it predicts
X is faster.
How does the
prediction hit the nail
when X is actually
faster.
35. Precisions and Recalls with cross-validation
35
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision