Fourth Workshop on Accelerator Programming Using Directives (WACCPD2017, co-located with SC17)
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore 1) program features to be extracted by a compiler and 2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
This document discusses using machine learning techniques to perform runtime selection of CPUs or GPUs for executing Java programs. It describes challenges in supporting Java features like exceptions on GPUs and accelerating Java programs. Features like loop characteristics, instruction counts, memory accesses are extracted from programs to train an SVM model to predict faster device. Evaluating on 11 apps, the model achieves 97.6-99% accuracy using 5-fold cross validation to avoid overfitting. This runtime selection approach can adapt to new hardware without needing to rebuild performance models.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.
In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
The document provides information on Caffe layers and networks for image classification tasks. It describes common layers used in convolutional neural networks (CNNs) like Convolution, Pooling, ReLU and InnerProduct. It also discusses popular CNN architectures for datasets such as MNIST, CIFAR-10 and ImageNet and the steps to prepare the data and train these networks in Caffe. Experiments comparing different CNN configurations on a 4-class image dataset show that removal of layers degrades performance, indicating their importance.
The document introduces two approaches to chemical prediction: quantum simulation based on density functional theory and machine learning based on data. It then discusses using graph-structured neural networks for chemical prediction on datasets like QM9. It presents Neural Fingerprint (NFP) and Gated Graph Neural Network (GGNN) models for predicting molecular properties from graph-structured data. Chainer Chemistry is introduced as a library for chemical and biological machine learning that implements these graph convolutional networks.
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
This document discusses using machine learning techniques to perform runtime selection of CPUs or GPUs for executing Java programs. It describes challenges in supporting Java features like exceptions on GPUs and accelerating Java programs. Features like loop characteristics, instruction counts, memory accesses are extracted from programs to train an SVM model to predict faster device. Evaluating on 11 apps, the model achieves 97.6-99% accuracy using 5-fold cross validation to avoid overfitting. This runtime selection approach can adapt to new hardware without needing to rebuild performance models.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.
In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introduction-to-the-tvm-open-source-deep-learning-compiler-stack-a-presentation-from-octoml/
Luis Ceze, Co-founder and CEO of OctoML, a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, and Venture Partner at Madrona Venture Group, presents the “Introduction to the TVM Open Source Deep Learning Compiler Stack” tutorial at the September 2020 Embedded Vision Summit.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms — such as mobile phones, embedded devices, and accelerators — requires significant manual effort.
In this talk, Ceze presents his work on the TVM stack, which exposes graph- and operator-level optimizations to provide performance portability for deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of optimizations.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
The document provides information on Caffe layers and networks for image classification tasks. It describes common layers used in convolutional neural networks (CNNs) like Convolution, Pooling, ReLU and InnerProduct. It also discusses popular CNN architectures for datasets such as MNIST, CIFAR-10 and ImageNet and the steps to prepare the data and train these networks in Caffe. Experiments comparing different CNN configurations on a 4-class image dataset show that removal of layers degrades performance, indicating their importance.
The document introduces two approaches to chemical prediction: quantum simulation based on density functional theory and machine learning based on data. It then discusses using graph-structured neural networks for chemical prediction on datasets like QM9. It presents Neural Fingerprint (NFP) and Gated Graph Neural Network (GGNN) models for predicting molecular properties from graph-structured data. Chainer Chemistry is introduced as a library for chemical and biological machine learning that implements these graph convolutional networks.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-siegel
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Yair Siegel, Director of Segment Marketing at CEVA, presents the "Fast Deployment of Low-power Deep Learning on CEVA Vision Processors" tutorial at the May 2016 Embedded Vision Summit.
Image recognition capabilities enabled by deep learning are benefitting more and more applications, including automotive safety, surveillance and drones. This is driving a shift towards running neural networks inside embedded devices. But, there are numerous challenges in squeezing deep learning into resource-limited devices. This presentation details a fast path for taking a neural network from research into an embedded implementation on a CEVA vision processor core, making use of CEVA’s neural network software framework. Siegel explains how the CEVA framework integrates with existing deep learning development environments like Caffe, and how it can be used to create low-power embedded systems with neural network capabilities.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
Accelerate Machine Learning Software on Intel Architecture Intel® Software
This session presents performance data for deep learning training for image recognition that achieves greater than 24 times speedup performance with a single Intel® Xeon Phi™ processor 7250 when compared to Caffe*. In addition, we present performance data that shows training time is further reduced by 40 times the speedup with a 128-node Intel® Xeon Phi™ processor cluster over Intel® Omni-Path Architecture (Intel® OPA).
This document provides MATLAB examples of neural networks, including:
1. Calculating the output of a simple neuron and plotting it over a range of inputs.
2. Creating a custom neural network, defining its topology and transfer functions, training it on sample data, and calculating outputs.
3. Classifying linearly separable data with a perceptron network and plotting the decision boundary.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
Robert Sheen from HPE gave a presentation on machine learning applications and accelerating deep learning. He provided a quick introduction to neural networks, discussing their structure and how they are inspired by biological neurons. Deep learning requires high performance computing due to its computational intensity during training. Popular deep learning frameworks like CogX were also discussed, which provide tools and libraries to help build and optimize neural networks. Finally, several enterprise use cases for machine learning and deep learning were highlighted, such as in finance, healthcare, security, and geospatial applications.
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
My first paper presentation slides of a paper in ICML 2016.
2017/05/31 (with Mion) in NCTU course: Deep Learning and Practice
2017/09/08 in NTHU course: The Cutting Edge of Deep Learning
Paper Information:
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu
https://arxiv.org/abs/1602.01783
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
This session discuss the implementation and performance of the K-nearest neighbor (KNN) computation on a distributed architecture using the Intel® Xeon Phi™ processor.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
Squeezing Deep Learning Into Mobile PhonesAnirudh Koul
A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smart phones. Highlights some frameworks and best practices.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
1) The document describes an object detection system that uses a multiscale sliding window approach with fully pipelined binarized convolutional neural networks (BCNNs) implemented on an FPGA.
2) The system detects and classifies multiple objects in images by applying BCNNs to windows at different scales and locations, and suppresses overlapping detections.
3) Experimental results on a Zynq UltraScale+ MPSoC FPGA demonstrate that the proposed pipelined BCNN architecture can achieve higher accuracy than GPU-based detectors while using less than 5W of power.
Profilers find performance bottlenecks in your app but provide confusing information. Let's give you insights into how your profiler and your app are really interacting. What profiling APIs are available, how they work, and what their implementation on the JVM (OpenJDK) side looks like:
Stack sampling profilers: stop motion view of your app
GetCallTrace(JVisualVM case study): The official stack sampling API
Safepoints and safepoint sampling bias
AsyncGetCallTrace(Honest Profiler Case Study): The unofficial API
JVM Profilers vs System Profilers: No API needed?
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
Supermicro provides energy efficient server solutions optimized for GPU computing. Their portfolio includes 1U and 4U servers that support up to 10 GPUs, delivering the highest rack-level and node-level GPU density. Their new generation of solutions are optimized for machine learning applications using NVIDIA Pascal GPUs, with features like NVLink for high bandwidth GPU interconnect and direct low latency data access between GPUs. These solutions deliver the highest performance per watt for parallel workloads like machine learning training.
The document discusses optimizing parallelism in NumPy-based programs. It provides examples of optimizing a main function from 50.1 ms to 2.83 ms using profiling and optimization. It discusses approaches for performant numerical code including vectorization and Python compilers. It also covers issues with oversubscription when using all CPU cores and parallel APIs in NumPy, SciPy, and scikit-learn. The document provides recommendations for tuning default parallel behavior and controlling parallelism in packages.
This document proposes CNNECST, an automated framework for hardware acceleration of Convolutional Neural Networks (CNNs) on FPGAs. The framework bridges the gap between high-level ML frameworks and FPGA design. It features a modular dataflow architecture and integration with ML frameworks for specifying, training, and exporting CNN models to the FPGA. Experimental results show that CNNECST achieves significant speedups and energy efficiency gains compared to a CPU for two CNNs and datasets. Challenges include supporting more layer types and reduced precision data formats.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/ceva/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit-siegel
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Yair Siegel, Director of Segment Marketing at CEVA, presents the "Fast Deployment of Low-power Deep Learning on CEVA Vision Processors" tutorial at the May 2016 Embedded Vision Summit.
Image recognition capabilities enabled by deep learning are benefitting more and more applications, including automotive safety, surveillance and drones. This is driving a shift towards running neural networks inside embedded devices. But, there are numerous challenges in squeezing deep learning into resource-limited devices. This presentation details a fast path for taking a neural network from research into an embedded implementation on a CEVA vision processor core, making use of CEVA’s neural network software framework. Siegel explains how the CEVA framework integrates with existing deep learning development environments like Caffe, and how it can be used to create low-power embedded systems with neural network capabilities.
This document provides an overview and agenda for a tutorial on deep learning implementations and frameworks. The tutorial is split into two sessions. The first session will cover basics of neural networks, common design aspects of neural network implementations, and differences between deep learning frameworks. The second session will include coding examples of different frameworks and a conclusion. Slide decks and resources will be provided on topics including basics of neural networks, common design of frameworks, and differences between frameworks. The tutorial aims to introduce fundamentals of deep learning and compare popular frameworks.
1. The document discusses GPUs and their advantages for machine learning tasks like deep learning and parallel computing. GPUs have many parallel processors that can accelerate matrix multiplications and other computations used in machine learning algorithms.
2. It introduces CUDA and how it allows GPUs to be programmed for general purpose processing through a parallel computing model. Examples are given of how matrix multiplications and convolutional neural network operations can be parallelized on GPUs.
3. H2O is presented as a machine learning platform that supports GPU acceleration for algorithms like gradient boosted machines, enabling faster training on large datasets. Instructions are provided on getting started with CUDA, cuDNN and using GPUs for machine learning.
Accelerate Machine Learning Software on Intel Architecture Intel® Software
This session presents performance data for deep learning training for image recognition that achieves greater than 24 times speedup performance with a single Intel® Xeon Phi™ processor 7250 when compared to Caffe*. In addition, we present performance data that shows training time is further reduced by 40 times the speedup with a 128-node Intel® Xeon Phi™ processor cluster over Intel® Omni-Path Architecture (Intel® OPA).
This document provides MATLAB examples of neural networks, including:
1. Calculating the output of a simple neuron and plotting it over a range of inputs.
2. Creating a custom neural network, defining its topology and transfer functions, training it on sample data, and calculating outputs.
3. Classifying linearly separable data with a perceptron network and plotting the decision boundary.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
A Platform for Accelerating Machine Learning ApplicationsNVIDIA Taiwan
Robert Sheen from HPE gave a presentation on machine learning applications and accelerating deep learning. He provided a quick introduction to neural networks, discussing their structure and how they are inspired by biological neurons. Deep learning requires high performance computing due to its computational intensity during training. Popular deep learning frameworks like CogX were also discussed, which provide tools and libraries to help build and optimize neural networks. Finally, several enterprise use cases for machine learning and deep learning were highlighted, such as in finance, healthcare, security, and geospatial applications.
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
My first paper presentation slides of a paper in ICML 2016.
2017/05/31 (with Mion) in NCTU course: Deep Learning and Practice
2017/09/08 in NTHU course: The Cutting Edge of Deep Learning
Paper Information:
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu
https://arxiv.org/abs/1602.01783
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
This session discuss the implementation and performance of the K-nearest neighbor (KNN) computation on a distributed architecture using the Intel® Xeon Phi™ processor.
This document proposes using unikernels and specialized machine learning compilers and runtimes to enable distributed machine learning on IoT devices. It demonstrates an end-to-end proof-of-concept for TinyML as a service that trains a MNIST model, compiles it to run on an ESP32 microcontroller, and performs inference on handwritten digits. Next steps include adding orchestration with CoAP, supporting more devices and complex models, distributed training on microcontrollers, and distributed inference across heterogeneous hardware accelerators.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
Squeezing Deep Learning Into Mobile PhonesAnirudh Koul
A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smart phones. Highlights some frameworks and best practices.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
1) The document describes an object detection system that uses a multiscale sliding window approach with fully pipelined binarized convolutional neural networks (BCNNs) implemented on an FPGA.
2) The system detects and classifies multiple objects in images by applying BCNNs to windows at different scales and locations, and suppresses overlapping detections.
3) Experimental results on a Zynq UltraScale+ MPSoC FPGA demonstrate that the proposed pipelined BCNN architecture can achieve higher accuracy than GPU-based detectors while using less than 5W of power.
Profilers find performance bottlenecks in your app but provide confusing information. Let's give you insights into how your profiler and your app are really interacting. What profiling APIs are available, how they work, and what their implementation on the JVM (OpenJDK) side looks like:
Stack sampling profilers: stop motion view of your app
GetCallTrace(JVisualVM case study): The official stack sampling API
Safepoints and safepoint sampling bias
AsyncGetCallTrace(Honest Profiler Case Study): The unofficial API
JVM Profilers vs System Profilers: No API needed?
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
From my talk at the Data & AI summit - latest update on the PyTorch Profiler and how you can use it for optimizations for efficiency. Talk also dives into the future and what we need to do together as an industry to move towards Sustainable AI
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
Supermicro provides energy efficient server solutions optimized for GPU computing. Their portfolio includes 1U and 4U servers that support up to 10 GPUs, delivering the highest rack-level and node-level GPU density. Their new generation of solutions are optimized for machine learning applications using NVIDIA Pascal GPUs, with features like NVLink for high bandwidth GPU interconnect and direct low latency data access between GPUs. These solutions deliver the highest performance per watt for parallel workloads like machine learning training.
The document discusses optimizing parallelism in NumPy-based programs. It provides examples of optimizing a main function from 50.1 ms to 2.83 ms using profiling and optimization. It discusses approaches for performant numerical code including vectorization and Python compilers. It also covers issues with oversubscription when using all CPU cores and parallel APIs in NumPy, SciPy, and scikit-learn. The document provides recommendations for tuning default parallel behavior and controlling parallelism in packages.
This document proposes CNNECST, an automated framework for hardware acceleration of Convolutional Neural Networks (CNNs) on FPGAs. The framework bridges the gap between high-level ML frameworks and FPGA design. It features a modular dataflow architecture and integration with ML frameworks for specifying, training, and exporting CNN models to the FPGA. Experimental results show that CNNECST achieves significant speedups and energy efficiency gains compared to a CPU for two CNNs and datasets. Challenges include supporting more layer types and reduced precision data formats.
This document presents CNNECST, an automated framework for hardware acceleration of convolutional neural networks (CNNs) on FPGAs. The framework bridges machine learning frameworks and FPGA design through high-level APIs, an intermediate representation, and C++ libraries. It was tested on two datasets and FPGAs, achieving speedups of 45x over CPUs and higher energy efficiency, while maintaining accuracy. Challenges include supporting more layer types and reduced precision data.
This document proposes CNNECST, an automated framework for designing and implementing convolutional neural networks (CNNs) targeting FPGAs. The framework bridges the gap between high-level machine learning frameworks and FPGA design. It features high-level APIs to design CNNs, integration with ML frameworks for training, and custom libraries for a hardware dataflow architecture. Experimental results on FPGAs show CNNECST can accelerate CNNs with higher performance and lower energy compared to CPUs.
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
Computer Applications in Control
ACRRL
Applied Control & Robotics Research Laboratory of Shiraz University
Department of Power and Control Engineering, Shiraz University, Fars, Iran.
Instructor: Dr. Asemani
TA: Mohammad Sabouri
https://sites.google.com/view/acrrl/
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
1) The document surveys research on parallel computing using multicore CPUs and GPUs, and its implications for system software.
2) It discusses parallel programming models like OpenMP, Intel TBB, CUDA, and OpenCL. It also covers research on optimizing memory allocation, reducing system call overhead, and revisiting OS architecture for manycore systems.
3) The document reviews work on supporting GPUs in virtualized environments through techniques like GPU virtualization. It also summarizes projects that utilize the GPU in middleware for tasks like network packet processing.
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.
Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.
In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.
The document evaluates the performance of parallelizing a naive matrix multiplication algorithm using OpenMP on a multicore processor. It implements sequential and parallel versions of square matrix multiplication and measures execution time. Results show that as the number of threads and matrix size increase, execution time decreases and speedup and efficiency increase, demonstrating that the parallel implementation outperforms the serial version, especially for larger datasets.
The document evaluates the performance of parallelizing a naive matrix multiplication algorithm using OpenMP on a multicore processor. It implements sequential and parallel versions of square matrix multiplication and measures execution time. Results show that as the number of threads and matrix size increase, execution time decreases and speedup and efficiency increase, demonstrating that the parallel implementation outperforms the serial version, especially for larger datasets.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
Introduction of Chainer, a framework for neural networks, v1.11. Slides used for the student seminar on July 20, 2016, at Sugiyama-Sato lab in the Univ. of Tokyo.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
This document summarizes early benchmarking results for neuromorphic computing using Intel's Loihi chip. It finds that Loihi provides orders of magnitude gains over CPUs and GPUs for certain workloads that are directly trained on the chip or use novel bio-inspired algorithms. These include online learning, adaptive control, event-based vision and tactile sensing, constraint satisfaction problems, and nearest neighbor search. Larger networks and problems tend to provide greater performance gains with Loihi.
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
IBM researchers presented techniques for executing Java programs on GPUs using IBM Java 8. Developers can write parallel programs using standard Java 8 stream APIs without annotations. The IBM Java runtime optimizes the programs for GPU execution by exploiting read-only caches, reducing data transfers between CPU and GPU, and eliminating redundant exception checks. Benchmark results showed the GPU version was 58.9x faster than single-threaded CPU code and 3.7x faster than 160-threaded CPU code on average, achieving good performance gains.
This document discusses deep learning initiatives at NECSTLab focused on hardware acceleration of convolutional neural networks using FPGAs. It proposes a framework called CNNECST that provides high-level APIs to design CNNs, integrates with machine learning frameworks for training, and generates customized hardware for FPGA implementation through C++ libraries and Vivado. Experimental results show speedups and energy savings for CNNs like LeNet and MNIST on FPGA boards compared to CPU. Challenges and future work include supporting more layer types and reduced precision computations.
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsAkihiro Hayashi
The ACM SIGPLAN 6th Annual Chapel Implementers and Users Workshop (CHIUW2019) co-located with PLDI 2019 / ACM FCRC 2019.
PGAS (Partitioned Global Address Space) programming models were originally designed to facilitate productive parallel programming at both the intra-node and inter-node levels in homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes in a cluster. Among high-level PGAS programming languages, Chapel is well suited for this task due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different compute nodes, as well as for different processing units (CPU vs. GPU) within a node. In this paper, we address some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. First, we introduce a Chapel module, GPUIterator, which is a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native GPU programs/libraries, which is an important requirement in practice since there is still a big performance gap between compiler-generated GPU code and hand-turned GPU code; hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Second, though Chapel programs are regularly executed on multi-node clusters, past work on GPU enablement of Chapel programs mainly focused on single-node execution. In contrast, our work supports execution across multiple CPU+GPU nodes by accepting Chapel's distributed domains. Third, our approach supports hybrid execution of a Chapel parallel (forall) loop across both a GPU and CPU cores, which is beneficial for specific platforms. Our preliminary performance evaluations show that the use of the GPUIterator is a promising approach for Chapel programmers to easily utilize a single or multiple CPU+GPU node(s) while maintaining portability.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
This document summarizes work optimizing a deep convolutional network for the Intel Xeon Phi coprocessor. The optimizations included loop unrolling, vectorization using SIMD intrinsics, and parallelization with OpenMP. Testing on an Intel Core i7 showed up to 6.3x speedup from vectorization. Mapping to the Xeon Phi with its 512-bit SIMD units yielded up to 11x speedup over an unoptimized version. Roofline models showed performance was bounded by memory bandwidth. Overall, the work contributed optimized code for convolutional networks running up to 43 frames per second on a Xeon Phi.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Best Practices for performance evaluation and diagnosis of Java Applications ...IndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Enterprise applications typically comprise of multi layered stacks including the application modules, application servers, the Java Virtual Machine and the underlying Operating System. Consequently the performance of these applications are a factor of these different layers. In the eventuality of a performance problem, it is often difficult to determine the starting point for diagnosis. The Java Virtual Machine is the ‘engine’ for most of the applications. It is responsible broadly for efficient execution and memory management of applications. End users have difficulty attributing the effect of the JVM on the performance of the application. This is because usually JVM is viewed as a ‘black box’.
This talk provides an insight into the key subsystems of the JVM by looking under the hood of a high performance JVM. It ventures onto talk about approaches and techniques for analyzing performance issues. It concludes by introducing the audience to a tool called the “Health Center” which is useful for evaluating and comprehending the JVM behavior of a running application in an unobtrusive, lightweight manner.
Takeaways for the Audience A better understanding of key JVM components, approaches and techniques to diagnose performance issues and performance evaluation using the Health Center
Similar to Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs (20)
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesAkihiro Hayashi
With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed.
While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the ex- isting Qthreads backend of Chapel.
Polyhedral compilation uses the polyhedral model to represent programs as systems of affine inequalities over iteration variables. This allows loop transformations like fusion, distribution, skewing and reversal to be expressed as affine mappings on the iteration space. The key aspects are representing the iteration domain, scheduling functions that determine the execution order of statements, and memory accesses in terms of iteration vectors. Loop transformations are specified by changing the scheduling functions to map iterations to new logical execution times while preserving semantics. This enables optimizations at the level of whole programs or subprograms.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
LLVM-based Communication Optimizations for PGAS ProgramsAkihiro Hayashi
The Second Workshop on the LLVM Compiler Infrastructure in HPC (Co-located with SC15)
While Partitioned Global Address Space (PGAS) programming languages such as UPC/UPC++, CAF, Chapel and X10 provide highlevel programming models for facilitating large-scale distributed memory parallel programming, it is widely recognized that compiler analysis and optimization for these languages has been very limited, unlike the optimization of SMP models such as OpenMP. One reason for this limitation is that current optimizers for PGAS programs are specialized to different languages. This is unfortunate since communication optimization is an important class of compiler optimizations for PGAS programs running on distributed memory platforms, and these optimizations need to be performed more widely. Thus, a more effective approach would be to build a language independent and runtime-independent compiler framework for optimizing PGAS programs so that new communication optimizations can be leveraged by different languages. To address this need, we introduce an LLVM-based (Low Level Virtual Machine) communication optimization framework. Our compilation system leverages existing optimization passes and introduces new PGAS language-aware runtime dependent/independent passes to reduce communication overheads. Our experimental results show an average performance improvement of 3.5× and 3.4× on 64-nodes of a Cray XC30TM supercomputer and 32-nodes of a Westmere cluster respectively, for a set of benchmarks written in the Chapel language. Overall, we show that our new LLVMbased compiler optimization framework can effectively improve the performance of PGAS programs.
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Akihiro Hayashi
This document discusses research on automatic parallelization for heterogeneous and homogeneous multicore processors. It presents Akihiro Hayashi's PhD defense at Waseda University on this topic. It motivates the need for automatic parallelization due to difficulties in programming multicore processors. It proposes a solution called OSCAR that uses a heterogeneous multicore compiler with APIs to enable automatic parallelization across different processor types. The methodology involves hint directives, parallelization of tasks, power reduction techniques, and generation of executables. It evaluates the approach on media applications using a Renesas multicore processor.
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...Akihiro Hayashi
This document summarizes research on LLVM optimizations for PGAS (Partitioned Global Address Space) programs like Chapel. It discusses generating LLVM IR from Chapel to enable optimizations like LICM (Loop Invariant Code Motion). Evaluations show LLVM optimizations remove many communication operations and improve performance for some applications vs. C code generation. However, LLVM constraints and wide pointer overhead hurt performance for other applications. Future work includes more applications, possibly-remote to definitely-local transformations, and parallel intermediate representations in LLVM.
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Accelerating Habanero-Java Program with OpenCL GenerationAkihiro Hayashi
Accelerating Habanero-Java Program with OpenCL Generation. Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. 10th International Conference on the Principles and Practice of Programming in Java (PPPJ), September 2013.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Choosing The Best AWS Service For Your Website + API.pptx
Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs
1. Exploration of Supervised Machine Learning
Techniques for Runtime Selection of CPU vs.GPU
Execution in Java Programs
Gloria Kim (Rice University)
Akihiro Hayashi (Rice University)
Vivek Sarkar (Georgia Tech)
1
2. Java For HPC?
Dual Socket Intel Xeon E5-2666 v3, 18cores, 2.9GHz, 60-GiB DRAM
2
4K x 4K Matrix Multiply GFLOPS Absolute speedup
Python 0.005 1
Java 0.058 11
C 0.253 47
Parallel loops 1.969 366
Parallel divide and conquer 36.168 6,724
+ vectorization 124.945 23,230
+ AVX intrinsics 335.217 62,323
Strassen 361.681 67,243
Credits : Charles Leiserson, Ken Kennedy Award Lecture @ Rice University, Karthik Murthy @ PACT2015
Any ways to
accelerate
Java
programs?
4. Explicit Parallelism with Java
High-level parallel programming with Java
offers opportunities for
preserving portability
Low-level parallel/accelerator programming is not required
enabling compiler to perform parallel-aware
optimizations and code generation
Java 8 Parallel Stream API
Multi-
core
CPUs
Many-
core
GPUs
FPGAs
Java 8 Programs
HW
SW
4
5. Challenges for
GPU Code Generation
5
IntStream.range(0, N).parallel().forEach(i -> <lambda>);
Challenge 1
Supporting Java
Features on GPUs
Challenge 3
CPU / GPU
Selection
Challenge 2
Accelerating Java
programs on GPUs
Exception Semantics
Virtual Method Calls
Kernel Optimizations
DT Optimizations
Selection of a faster device
from CPUs and GPUs
Standard Java API Call for Parallelism
Credits : Checklist by Phil Laver from the Noun Project, Rocket by Luis Prado from the Noun Project
Rocket by Maxwell Painter Karpylev from the Noun Project, Running by Dillon Arloff from the Noun Project
vs.
6. Related Work: Java + GPU
6
Lang JIT GPU Kernel Device Selection
JCUDA Java - CUDA GPU only
Lime Lime ✔ Override map/reduce Static
Firepile Scala ✔ reduce Static
JaBEE Java ✔ Override run GPU only
Aparapi Java ✔ map Static
Hadoop-CL Java ✔ Override map/reduce Static
RootBeer Java ✔ Override run Not Described
HJ-OpenCL HJ - forall / lambda Static
PPPJ09 (auto) Java ✔ For-loop Dynamic with Regression
Our Work Java ✔ Parallel Stream Dynamic with Machine Learning
None of these approaches considers Java 8 Parallel Stream APIs
and a dynamic device selection with machine-learning
7. JIT Compilation for GPU
IBM Java 8 Compiler
Built on top of the production version of the
IBM Java 8 runtime environment (J9 VM)
7
Multi-
core
CPUs
Many-
core
GPUs
method A
method A
method A
method A
Interpretation on
JVM
1st invocation
2nd invocation
Nth invocation
(N+1)th invocation
Native Code
Generation for
Multi-
core
CPUs
8. The JIT Compilation Flow
8
Java
Bytecode
Translation to
Our IR
Parallel
Streams
Identification
Optimization
for GPUs
NVVM IR libNVVM
Existing
Optimizations
Target Machine
Code Generation
GPU
Binary
CPU
Binary
GPU Runtime
Our JIT Compiler
NVIDIA’s Tool Chain
11. Our approach:
ML-based Performance Heuristics
A binary prediction model is constructed by supervised
machine learning techniques
11
bytecode
App A
Prediction
Model
JIT compiler
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
ML
Training run with JIT Compiler Offline Model Construction
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
CPUGPU
12. Features of program
Loop Range (Parallel Loop Size)
The dynamic number of Instructions in IR
Memory Access
Arithmetic Operations
Math Methods
Branch Instructions
Other Types of Instructions
12
15. Offline Prediction Model
Construction
Obtained 291 samples by running 11 applications with different data
sets
Additional 41 samples by running additional 3 applications
Choice is either GPU or 160 worker threads on CPU
15
bytecode
App A
Prediction
Model
feature 1
data
1
bytecode
App A
data
2
bytecode
App B
data
3
feature 2
feature 3
ML
Algorithms
feature
extraction
feature
extraction
feature
extraction
Java
Runtime
Training run with JIT Compiler Offline Model Construction
16. Platform
CPU
IBM POWER8 @ 3.69GHz
20 cores
8 SMT threads per core = up to 160 threads
256 GB of RAM
GPU
NVIDIA Tesla K40m
12GB of Global Memory
16
17. Applications
17
Application Source Field Max Size Data Type
BlackScholes Finance 4,194,304 double
Crypt JGF Cryptography Size C (N=50M) byte
SpMM JGF Numerical Computing Size C (N=500K) double
MRIQ Parboil Medical Large (64^3) float
Gemm Polybench
Numerical
Computing
2K x 2K int
Gesummv Polybench 2K x 2K int
Doitgen Polybench 256x256x256 int
Jacobi-1D Polybench N=4M, T=1 int
Matrix Multiplication 2K x 2K double
Matrix Transpose 2K x 2K double
VecAdd 4M double
19. Top 3 ML Algorithms
Full Set of Features (10 Features)
19
98.28 98.28 97.6099.66 98.97 98.9798.28 98.68
92.68
0
20
40
60
80
100
LIBSVM J48 Tree Logistic Regression
Accuracy(%)
Higher is better
Accuracy from 5-fold CV Accuracy on original training data
Accuracy on unknown testing data
20. Further Explorations and Analyses
by Feature Subsetting
Research Questions
Are the 10 features the best combination?
Accuracy
Runtime prediction overheads
Which features are more important?
20
21. Feature Sub-Setting for further analyses:
10 Algorithms x 1024 Subsets
21
1. Loop Range (Parallel Loop Size)
2. # of Memory Accesses
3. # of Arithmetic Operations
4. # of Math Methods
5. # of Branch Instructions
6. # of Other Types of Instructions
7. # of Coalesced Memory Accesses
8. # of Offset Memory Accesses
9. # of Stride Memory Accesses
10. # of Other Types of Array
Accesses
10 Features
Subset 1
Subset 2
Subset 1024
SVM
Decision
Trees
Logistic
Regression
Multi Layer
Perceptron
kNN
Decision
Stumps
Naïve
Bayes
Accuracy X%
Accuracy Y%
Accuracy Z%
Accuracy A%
Accuracy B%
Accuracy C%
Accuracy D%
22. The impact of Subsetting
(Fullset vs. the best subset)
22
99.656 98.625 98.28298.282 97.595 98.282
0
20
40
60
80
100
LIBSVM Logistic Regression J48 Tree
Accuracy(%)
Higher is better
Subset of Features Fullset of features
Feature subsetting can yield better accuracies
23. The impact of Subsetting
(Fullset vs. the best subset)
23
99.66 98.63 98.2899.66 98.97 98.2899.66
82.93
92.68
0
10
20
30
40
50
60
70
80
90
100
LIBSVM (# of features = 3) Logistic Regression (# of features = 4) J48 Tree (# of features = 2)
Accuracy(%)
Higher is better
Accuracy from 5-fold CV Accuracy on original training data Accuracy on unknown testing data
Subsets with only a few features show great accuracies
24. An example of Subsetting analyses
24
With LIBSVM, 99 subsets achieved the highest accuracy
Feature # of Models with this feature Percentage of Models with this feature
range 99 100.0%
stride 96 97.0%
arithmetic 65 65.7%
Other 2 56 56.6%
memory 56 56.6%
offset 55 55.6%
branch 54 54.5%
math 46 46.5%
Other 1 43 43.4%
25. Runtime Prediction Overheads
LIBSVM
Logistic
Regression
J48 Decision
Trees
Full Set (10
features)
2.278 usec 0.158 usec 0.020 usec
Subset
2.107 usec
(3 Features)
0.106 usec
(4 Features)
0.020 usec
(2 Features)
25
Subsetting can reduce prediction overheads
However, this part is very small compared to the kernel execution
part
Based on our analysis, kernels usually take a few milliseconds
26. Lessons Learned
ML-based CPU/GPU Selection is a promising
way to choose faster devices
LIBSVM, Logistic Regression, and J48
Decision Tree are machine learning
techniques that produce models with best
accuracies
Range, coalesced, other types of array
accesses, and arithmetic instructions are
particularly important features
26
27. Lessons Learned (Cont’d)
While LIBSVM (Non-linear classification) shows
excellent accuracy in prediction, runtime prediction
overheads are relatively larger compared to other
algorithms
However, those overheads are negligible in general
J48 Decision Tree shows comparable accuracy to
LIBSVM. Also, the output of the J48 Decision Tree
is more human-readable and fine-tunable
27
28. Conclusions
Conclusions
ML-based CPU/GPU Selection is a promising
way to choose faster devices
Future Work
Evaluate End-to-end performance improvements
Explore the possibility of applying our technique
to OpenMP, OpenACC, and OpenCL
Perform experiments on recent versions of GPUs
28
29. Further Reading
Code Generation and Optimizations
“Compiling and Optimizing Java 8 Programs for GPU
Execution.” Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents,
Vivek Sarkar. 24th International Conference on Parallel
Architectures and Compilation Techniques (PACT), October
2015.
Performance Heuristics (SVM-based)
“Machine-Learning-based Performance Heuristics for
Runtime CPU/GPU Selection.” Akihiro Hayashi, Kazuaki
Ishizaki, Gita Koblents, Vivek Sarkar. 12th International
Conference on the Principles and Practice of Programming on
the Java Platform: virtual machines, languages, and tools
(PPPJ), September 2015.
29
Editor's Notes
Thanks for your introduction. Hi, everyone, My name is Akihiro Hayashi. I’m a research scientist at Rice university. Today, I’ll be talking about GPU code generation from Java programs and machine-learning based performance heuristics for runtime cpu/gpu selection. This work is done in collaboration with Kazuaki Ishizaki at IBM Research at Tokyo, Gita Koblents at IBM Canada, and Vivek Sarkar at Rice.
Okay, let me first talk about the motivation. Java language has been regarded as high-productive but not high-performance language for many years. What you see here is the performance of 4Kx4K matrix multiplication in different languages. Java is 11x faster than Python implementation, but Java is much much far from the state-of-the-art library implementation. One open question is that is there any ways to accelerate Java programs while maintaining high-productive features?
One good news is that the latest version of Java has introduced parallel stream API that offers more opportunity for parallel programming than previous editions have supported.
These APIs allow the programmers to write explicitly parallel programs with lambda expressions.
Here is an example of Java 8 Parallel Streams APIs. You first generate the range of a parallel loop with the range method.
In this case you generate a stream that ranges from 0 to N-1 and then you specify that this computation can be done in parallel and you also specify an actual computation with lambda expressions.
Such high-level parallel programming with Java offers opportunities for preserving portability and enabling compiler to perform parallel-aware optimizations and code generation. Since now we have uniform and high-level expressions for parallel programming, compilers could generate parallel code for various kinds of hardware including multi-core CPUs, many-core GPUs, and FPGAs.
Now we have standard Java API for explicit parallelism and we believe that generating GPU code is one of promising ways to accelerate Java programs. However, there are three challenges that we should cope with.
Challenge 1 is how to support Java Features such as exception semantics and virttual method calls on GPUs,
Challenge 2 is how to compile and optimize Java programs for GPU executions.
Challenge 3 is performance heuristics. Since GPUs are not always faster than CPUs, we should properly select a faster device from CPUs and GPUs. Suppose that you go to a grocery store, but we don’t want to use a rocket right? We definitely need some criteria to choose a faster device.
Okay, we have been working on IBM Java 8 compiler, which can generates GPU code from Java 8 programs. Like conventional JIT compilers, a method is firstly interpreted on JVM and then the method is going to get JIT compiled if the invocation count exceeds a certain threshold. We have already implemented a native code generator for many-core GPUs in addition to multi-core CPUs.
Here is a detail of the compilation flow. Our JIT compiler first takes Java byte code, and then generates our Intermediate representations (IRs). The compiler applies several existing optimizations such as PRE. When the compiler identified the parallel streams in IRs, we apply several kinds of GPU-aware. After that, the compiler generates NVVM IR, which is an extended version of LLVM IR, the generated NVVM IR is finally translated into a GPU binary.
Let’s talk about the motivation. Now we can generate parallel code for both CPUs and GPUs. However, one challenging problem is how to select a faster device for a given parallel loop since it depends on the characteristic of a program and data sets.
Here is an our approach. We build a prediction model that choose a faster device from GPU and CPU using machine learning. To do so, we run various applications with different data sets for training. The JIT compiler extracts features of programs that may affect performance during JIT compilation and the generated training data is eventually used for constructing a binary prediction model.
Here we have a list of features that we are using so far. As I mentioned earlier, parallel loop size and the number of instructions has a strong relationship with the execution time. That’s why we chose these features. For the number of instructions, we try to distinguish several types of instructions so that we can characterize the behavior of applications. For example, you may want to see if an application is either memory-bounded or computation-bounded.
Additionally, the number of array accesses would be an important factor affecting GPU performance. It is well known that coalesced access, which is aligned access, achieves better memory bandwidth than misaligned access. Also, data transfer size would be important factor affecting GPU performance.
Here is an example of feature vector. Expressed in JSON format