Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Published on 11 may, 2018
Chainer is a deep learning framework which is flexible, intuitive, and powerful.
This slide introduces some unique features of Chainer and its additional packages such as ChainerMN (distributed learning), ChainerCV (computer vision), ChainerRL (reinforcement learning), Chainer Chemistry (biology and chemistry), and ChainerUI (visualization).
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
Jack Han presented on profiling deep learning networks using NVIDIA tools. He discussed annotating PyTorch models with NVTX to identify bottlenecks, optimizing data loading in PyTorch, and achieving a 4x speedup on BERT by using mixed precision and Tensor Cores. He also covered profiling TensorFlow graphs with NVTX plugins and command examples for profiling multi-GPU applications with Nsight Systems.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
E&P organizations are turning more attention to accumulated data to enhance operating efficiency, safety, and recovery. The computing paradigm is shifting, the O&G paradigm is shifting, and the rise of the machine learning paradigm requires careful attention to top-down integrated systems engineering. A system approach will be presented to stimulate out-of-the-box thinking to address the machine learning paradigm.
The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresPluribus One
This document summarizes research on machine learning security and adversarial attacks. It describes how machine learning systems are increasingly being used for consumer applications, but this opens them up to new security risks from skilled attackers. The document outlines different types of adversarial attacks against machine learning, including evasion attacks that aim to evade detection and poisoning attacks that aim to compromise a system's availability. It also discusses approaches for systematically evaluating the security of pattern classification systems against bounded adversaries.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
These slides are from an instructor-led tutorial from GTC16. The talk discusses using a pre-release version of CLANG with support for OpenMP offloading directives to NVIDIA GPUs to experiement with OpenMP 4.5 target directives.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
Jack Han presented on profiling deep learning networks using NVIDIA tools. He discussed annotating PyTorch models with NVTX to identify bottlenecks, optimizing data loading in PyTorch, and achieving a 4x speedup on BERT by using mixed precision and Tensor Cores. He also covered profiling TensorFlow graphs with NVTX plugins and command examples for profiling multi-GPU applications with Nsight Systems.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
This talk was given at GTC16 by James Beyer and Jeff Larkin, both members of the OpenACC and OpenMP committees. It's intended to be an unbiased discussion of the differences between the two languages and the tradeoffs to each approach.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
ChainerUI v0.3 was released with new features like sampled log visualization and performance tuning. It also introduced the experimental ImageReport extension for visualizing images generated during training. Examples shown include using ImageReport with a DCGAN and pix2pix model to display generated images. Future work includes improving the usability of ImageReport, adding support for charts, logging improvements, and enhancing the user experience of ChainerUI.
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
E&P organizations are turning more attention to accumulated data to enhance operating efficiency, safety, and recovery. The computing paradigm is shifting, the O&G paradigm is shifting, and the rise of the machine learning paradigm requires careful attention to top-down integrated systems engineering. A system approach will be presented to stimulate out-of-the-box thinking to address the machine learning paradigm.
The document summarizes Kazuaki Ishizaki's talk on making hardware accelerators easier to use. Some key points:
- Programs are becoming simpler while hardware is becoming more complicated, with commodity processors including hardware accelerators like GPUs.
- The speaker's recent work focuses on generating hardware accelerator code from high-level programs without needing specific hardware knowledge.
- An approach using a Java JIT compiler was presented that can generate optimized GPU code from parallel Java streams, requiring programmers to only express parallelism.
- The JIT compiler performs optimizations like aligning arrays, using read-only caches, reducing data transfer, and eliminating exception checks.
- Benchmarks show the generated GPU
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresPluribus One
This document summarizes research on machine learning security and adversarial attacks. It describes how machine learning systems are increasingly being used for consumer applications, but this opens them up to new security risks from skilled attackers. The document outlines different types of adversarial attacks against machine learning, including evasion attacks that aim to evade detection and poisoning attacks that aim to compromise a system's availability. It also discusses approaches for systematically evaluating the security of pattern classification systems against bounded adversaries.
Machine Learning with Applications in Categorization, Popularity and Sequence...Nicolas Nicolov
This document provides an overview of machine learning techniques including categorization, popularity, and sequence labeling applications. It outlines the goals of introducing important machine learning concepts and illustrating techniques through examples. The tutorial aims to be self-contained and explain notation. The outline includes examples of machine learning applications, encoding objects with features, the machine learning framework, linear models, tree models, boosting, ranking evaluation, and sequence labeling with hidden Markov models.
Application of machine learning in industrial applicationsAnish Das
The group will present an introduction to machine learning, the basics of machine learning, and applications of machine learning in industry such as product categorization, improving the accuracy of inertial measurement units using supervised machine learning, data mining techniques, and machine learning for medical diagnosis. They will also discuss the future scope of machine learning.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
This document discusses using machine learning techniques to perform runtime selection of CPUs or GPUs for executing Java programs. It describes challenges in supporting Java features like exceptions on GPUs and accelerating Java programs. Features like loop characteristics, instruction counts, memory accesses are extracted from programs to train an SVM model to predict faster device. Evaluating on 11 apps, the model achieves 97.6-99% accuracy using 5-fold cross validation to avoid overfitting. This runtime selection approach can adapt to new hardware without needing to rebuild performance models.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
The lecture discusses manycore GPU architectures and programming using OpenMP and HOMP. It introduces OpenMP directives for offloading computation to accelerators and covers data mapping between the host and device. It also discusses HOMP for automated distribution of parallel loops and data across multiple accelerators to improve load balancing and performance. The document provides examples of using OpenMP target directives and data mapping for problems like AXPY and Jacobi iteration on a GPU. It evaluates performance of different loop scheduling algorithms in HOMP on a system with CPUs, GPUs and MICs.
One of the biggest issues for a developer – whether they are an engineer at an OEM or working for a mobile AI application startup – is that their apps are at the mercy of pre-set power and performance settings as defined by OEMs or Silicon vendors. So how can a developer break through that barrier when it seems their hands are tied behind their backs? The Snapdragon Power Optimization SDK allows developers to control the CPU and GPU frequency much more finely from their own application logic. This provides developers with more control within the bounds of the power/thermal framework.
How to use Apache TVM to optimize your ML modelsDatabricks
Apache TVM is an open source machine learning compiler that distills the largest, most powerful deep learning models into lightweight software that can run on the edge. This allows the outputed model to run inference much faster on a variety of target hardware (CPUs, GPUs, FPGAs & accelerators) and save significant costs.
In this deep dive, we’ll discuss how Apache TVM works, share the latest and upcoming features and run a live demo of how to optimize a custom machine learning model.
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Abstract: We will introduce RAPIDS, a suite of open source libraries for GPU-accelerated data science, and illustrate how it operates seamlessly with MLflow to enable reproducible training, model storage, and deployment. We will walk through a baseline example that incorporates MLflow locally, with a simple SQLite backend, and briefly introduce how the same workflow can be deployed in the context of GPU enabled Kubernetes clusters.
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
This document discusses various ways to accelerate applications using GPUs and CUDA programming. It provides examples of using libraries, programming languages like OpenACC and CUDA, and tools like Nsight to add GPU acceleration. It also highlights many success stories and how applications from fields like HPC, deep learning, and computational chemistry have achieved speedups using these techniques. Resources and compilers are available to help developers get started with GPU programming.
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Snap ML is a machine learning framework for fast training of generalized linear models (GLMs) that can scale to large datasets. It uses multi-level parallelism across nodes and GPUs. Snap ML implementations include snap-ml-local for single nodes, snap-ml-mpi for multi-node HPC environments, and snap-ml-spark for Apache Spark clusters. Experimental results show Snap ML can train a logistic regression model on a 3TB Criteo dataset within 1.5 minutes using 16 GPUs.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
Computer Applications in Control
ACRRL
Applied Control & Robotics Research Laboratory of Shiraz University
Department of Power and Control Engineering, Shiraz University, Fars, Iran.
Instructor: Dr. Asemani
TA: Mohammad Sabouri
https://sites.google.com/view/acrrl/
This document provides an update on PGI compilers and tools for heterogeneous supercomputing. It discusses PGI's support for OpenACC directives to accelerate applications on multicore CPUs and NVIDIA GPUs from a single source. It highlights new compiler features including support for Intel Skylake, AMD EPYC and IBM POWER9 CPUs as well as NVIDIA Volta GPUs. Benchmark results show strong performance of OpenACC applications on these platforms. The document also discusses the growing adoption of OpenACC in HPC applications and resources available to support OpenACC development.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses optimizing parallelism in NumPy-based programs. It provides examples of optimizing a main function from 50.1 ms to 2.83 ms using profiling and optimization. It discusses approaches for performant numerical code including vectorization and Python compilers. It also covers issues with oversubscription when using all CPU cores and parallel APIs in NumPy, SciPy, and scikit-learn. The document provides recommendations for tuning default parallel behavior and controlling parallelism in packages.
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
1) The document discusses optimizing Spark applications through JVM and OS tuning. Tuning aspects covered include JVM heap sizing, garbage collection options, process affinity, and large memory pages.
2) Benchmark results show that after applying these optimizations, execution time was reduced by 30-50% for Kmeans clustering and TPC-H queries compared to the default configuration.
3) Dividing the application across multiple smaller JVMs instead of a single large JVM helped reduce garbage collection overhead and resource contention, improving performance by up to 16%.
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
The document summarizes the successes and lessons learned from Fugaku, Japan's flagship supercomputer. Key points include:
- Fugaku achieved the top performance on all HPC benchmarks in 2020 and 2021, showing high performance across applications, not just traditional HPC workloads.
- While many applications achieved their target performance, some did not due to issues like insufficient parallelism, I/O scalability problems, and compiler vectorization failures.
- Lessons include the need for improved software stacks, application analysis, and adapting to modern applications beyond classic HPC.
- Looking ahead, sustained exascale performance will require data-centric architectures and corresponding system software and algorithms as transistor scaling slow
Similar to Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection (20)
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Polyhedral compilation uses the polyhedral model to represent programs as systems of affine inequalities over iteration variables. This allows loop transformations like fusion, distribution, skewing and reversal to be expressed as affine mappings on the iteration space. The key aspects are representing the iteration domain, scheduling functions that determine the execution order of statements, and memory accesses in terms of iteration vectors. Loop transformations are specified by changing the scheduling functions to map iterations to new logical execution times while preserving semantics. This enables optimizations at the level of whole programs or subprograms.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
This document discusses research on automatic parallelization for heterogeneous and homogeneous multicore processors. It presents Akihiro Hayashi's PhD defense at Waseda University on this topic. It motivates the need for automatic parallelization due to difficulties in programming multicore processors. It proposes a solution called OSCAR that uses a heterogeneous multicore compiler with APIs to enable automatic parallelization across different processor types. The methodology involves hint directives, parallelization of tasks, power reduction techniques, and generation of executables. It evaluates the approach on media applications using a Renesas multicore processor.
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
This document summarizes research on LLVM optimizations for PGAS (Partitioned Global Address Space) programs like Chapel. It discusses generating LLVM IR from Chapel to enable optimizations like LICM (Loop Invariant Code Motion). Evaluations show LLVM optimizations remove many communication operations and improve performance for some applications vs. C code generation. However, LLVM constraints and wide pointer overhead hurt performance for other applications. Future work includes more applications, possibly-remote to definitely-local transformations, and parallel intermediate representations in LLVM.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Pushing the limits of ePRTC: 100ns holdover for 100 days
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
1. Machine-Learning-based
Performance Heuristics
for Runtime CPU/GPU Selection
Akihiro Hayashi (Rice University)
Kazuaki Ishizaki (IBM Research - Tokyo)
Gita Koblents (IBM Canada)
Vivek Sarkar(Rice University)
1
ACM International Conference on Principles and Practices of
Programming on the Java Platform: virtual machines, languages, and tools (PPPJ’15)
3. Background:
Explicit Parallelism with Java
High-level parallel programming with
Java offers opportunities for
preserving portability
enabling compiler to perform parallel-
aware optimizations and code generation
3
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4. Background:
JIT Compilation for GPU Execution
IBM Java 8 Compiler
Built on top of the production version of
the IBM Java 8 runtime environment
4
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
5. Background:
The compilation flow of IBM Java 8 Compiler
5
Java
bytecode
parallel streams
identification in IR
Target machine
code generation
PTX2binary
module
NVIDIA GPU
native code
PowerPC
native code
libnvvm
Our
IR
PTX
JIT compiler Module by NVIDIA
Analysis and
optimizations
Our new modules for GPU
IR for
parallel streams
NVVM IR
generation
NVVM
IR
Existing
optimizations
Bytecode
translation
Runtime Helpers
GPU
Feature
extraction
Optimizations for GPUs
To improve performance
Read Only Cache Utilization / Buffer Alignment
Data Transfer Optimizations
To support Java’s language features
Loop versioning for eliminating redundant
exception checking on GPUs
Virtual method invocation support with
de-virtualization and loop versioning
6. Motivation:
Runtime CPU/GPU Selection
Selecting a faster hardware device
is a challenging problem
6
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
Nth invocation
(N+1)th invocation
Native Code
Generation for
(Problem)
Which one is
faster?
7. 0e+00 4e+07 8e+07
0.00.20.40.60.81.0
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
Related Work:
Linear regression
Regression based cost estimation
[1,2] is specific to an application
7
App 1) BlackScholes
[1] Leung et al. Automatic Parallelization for Graphics Processing Units (PPPJ’09)
[2] Kerr et al. Modeling GPU-CPU Workloads and Systems (GPGPU-3)
ExecutionTime(msec)
0e+00 4e+07 8e+07
01234
The dynamic number of IR instructions
KernelExecutionTime(msec)
NVIDIA Tesla K40 GPU
IBM POWER8 CPU
ExecutionTime(msec) App 2) Vector Addition
CPU
GPU
GPU
CPU
8. Open Question:
Accurate cost model is required?
Accurate cost model construction
would be too much
Considerable effort will be needed to
update performance models for future
generations of hardware
8
Multi-
core
CPUs
Many-
core
GPUs
Which one
is faster?
Machine-learning-based performance heuristics
9. Our Approach:
ML-based Performance Heuristics
A binary prediction model is constructed by
supervised machine learning with support
vector machines (SVMs)
9
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPU
GPU
10. Features of program
that may affect performance
Loop Range (Parallel Loop Size)
The dynamic number of Instructions
Memory Access
Arithmetic operations
Math Methods
Branch Instructions
Other Instructions
10
11. Features of program
that may affect performance (Cont’d)
The dynamic number of Array
Accesses
Coalesced Access (a[i]) (aligned access)
Offset Access (a[i+c])
Stride Access (a[c*i])
Other Access (a[b[i]])
Data Transfer Size
H2D Transfer Size
D2H Transfer Size
11
0 5 10 15 20 25 30
050100150200
c : offset or stride size
Bandwidth(GB/s)
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
Offset : array[i+c]
Stride : array[c*i]
13. Applications
13
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF
Numerical
Computing
Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix
Multiplication
2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
14. Platform
CPU
IBM POWER8 @ 3.69GHz
20-cores
8 SMT threads per cores
= up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
14
15. Prediction Model Construction
Obtained 291 samples by running 11
applications with different data sets
Choice is either GPU or 160 worker
threads on CPU
15
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
LIBSVM
3.2
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
16. Speedups and the accuracy with the max
data size: 160 worker threads vs. GPU
16
40.6 37.4
82.0
64.2
27.6
1.4 1.0
4.4
36.7
7.4 5.7
42.7 34.6
58.1
844.7 772.3
1.0
0.1
1.9
1164.8
9.0
1.2
0.0
0.1
1.0
10.0
100.0
1000.0
10000.0
SpeeduprelativetoSEQENTIALJava
(logscale)
Higher is better160 worker threads (Fork/join) GPU
Prediction x ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
17. How to evaluate prediction models:
TRUE/FALSE POSITIVE/NEGATIVE
4 types of binary prediction results
(Let POSITIVE is CPU, NEGATIVE is GPU)
TRUE POSITIVE
Correctly predicted that CPU is faster
TRUE NEGATIVE
Correctly predicted that GPU is faster
FALSE POSITIVE
Predicted CPU is faster, but GPU is actually faster
FALSE NEGATIVE
Predicted GPU is faster, but CPU is actually faster
17
18. How to evaluate prediction models:
Accuracy, Precision, and Recall metrics
Accuracy
the percentage of selections predicted correctly:
(TP + TN) / (TP + TN + FP + FN)
Precision X (X = CPU or GPU)
Precision CPU : TP / (TP + FP)
# of samples correctly predicted CPU is faster
/ total # of samples predicted that CPU is faster
Precision GPU : TN / (TN + FN)
Recall X (X = CPU or GPU)
Recall CPU : TP / (TP + FN)
# of samples correctly predicted CPU is faster
/ total # of samples labeled that CPU is actually faster
Recall GPU : TN / (TN + FP)
18
How precise is the model
when it predicts X is
faster.
How does the prediction
hit the nail when X is
actually faster.
19. How to evaluate prediction models:
5-fold cross validation
Overfitting problem:
Prediction model may be tailored to the eleven
applications if training data = testing data
To avoid the overfitting problem:
Calculate the accuracy of the prediction model
using 5-fold cross validation
19
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5
Subset 2Subset 1 Subset 3 Subset 4 Subset 5
Build a prediction model trained on Subset 2-5
Used for TESTING
Accuracy : X%, Precision : Y%,...
Used for TESTING
Accuracy : P%, Precision Q%, …
TRAINING DATA (291 samples)
Build a prediction model trained
on Subset 1, 3-5
20. 79.0%
97.6% 99.0% 99.0% 97.2%
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Range "+=nIRs" "+=dIRs" "+=Array" "ALL (+=DT)"
Accuracy (%), Total number of samples = 291
Accuracies with cross-validation:
160 worker threads or GPU
20Higher is better
21. Precisions and Recalls
with cross-validation
21
Precision
CPU
Recall
CPU
Precision
GPU
Recall
GPU
Range 79.0% 100% 0% 0%
+=nIRs 97.8% 99.1% 96.5% 91.8%
+=dIRs 98.7% 100% 100% 95.0%
+=Arrays 98.7% 100% 100% 95.0%
ALL 96.7% 100% 100% 86.9%
Higher is better
All prediction models except Range rarely make a bad decision
22. Discussion
Based on results with 291 samples,
(range, # of detailed Insns, # of array accesses)
shows the best accuracy
DT does not contribute to improving the accuracy since
the DT optimizations do not make GPUs faster
Pros. and Cons.
(+) Future generations of hardware can be
supported easily by re-running applications
(+) Just add another training data to rebuild a
prediction model
(-) Takes time to collect training data
Loop Range, # of arithmetic, and # of coalesced
accesses affects the decision
22
23. Related Work:
Java + GPU
23
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
24. Conclusions
Machine-learning based Performance
Heuristics
Up to 99% accuracy
Promising way to build performance heuristics
Future Work
Exploration of features of program (e.g. CFG)
Selection of the best configuration from
1 worker, 2 workers ,… 160 workers, GPU
Parallelizing Training Phase
For more details on GPU code generations
“Compiling and Optimizing Java 8 Programs for
GPU execution”, PACT15, October 2015
24