This talk starts with an overview of ailia SDK, then introduces optimization techniques for inferring neural networks at high speed in Arm environments. Based on our research for developing ailia SDK, we introduce the optimization for Arm CPU using NEON SIMD instructions and various optimal compute shader implementations for Arm Mali using Vulkan. In addition, we demonstrate how various machine learning models actually operate at high speed in Arm environments.
The document discusses deep learning applications design, development and deployment in IoT edge. It describes using a Power9 system to train artificial neural network models using the MNIST dataset. It also covers building inference engines for Android phones and deploying visual recognition models to IBM Watson Studio.
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
The document introduces Bolt, a C++ template library for heterogeneous system architecture (HSA) that aims to improve developer productivity for GPU programming. Bolt provides optimized library routines for common GPU operations using open standards like OpenCL and C++ AMP. It resembles the familiar C++ Standard Template Library. Bolt allows programming GPUs as easily as CPUs, handles workload distribution across devices, and provides a single source code base for both CPU and GPU. Examples show how Bolt can be used with C++ AMP and OpenCL, including passing user-defined functors. An exemplary video enhancement application demonstrates Bolt's use in a commercial product.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
These are the slides to my talk "A Deep Dive into QtCanBus" from Qt World Summit 2019. I show how to build the middleware between the CAN bus and the QML HMI.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
This document discusses AMD's DirectGMA technology, which allows direct access to GPU memory from other devices. It introduces DirectGMA and explains how it enables peer-to-peer transfers between GPUs and GPUs and FPGAs. It then provides details on implementing DirectGMA in APIs like OpenGL, OpenCL, DirectX 9, 10 and 11 to enable efficient data transfers without CPU involvement.
The document discusses deep learning applications design, development and deployment in IoT edge. It describes using a Power9 system to train artificial neural network models using the MNIST dataset. It also covers building inference engines for Android phones and deploying visual recognition models to IBM Watson Studio.
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
The document introduces Bolt, a C++ template library for heterogeneous system architecture (HSA) that aims to improve developer productivity for GPU programming. Bolt provides optimized library routines for common GPU operations using open standards like OpenCL and C++ AMP. It resembles the familiar C++ Standard Template Library. Bolt allows programming GPUs as easily as CPUs, handles workload distribution across devices, and provides a single source code base for both CPU and GPU. Examples show how Bolt can be used with C++ AMP and OpenCL, including passing user-defined functors. An exemplary video enhancement application demonstrates Bolt's use in a commercial product.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
These are the slides to my talk "A Deep Dive into QtCanBus" from Qt World Summit 2019. I show how to build the middleware between the CAN bus and the QML HMI.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
This document discusses AMD's DirectGMA technology, which allows direct access to GPU memory from other devices. It introduces DirectGMA and explains how it enables peer-to-peer transfers between GPUs and GPUs and FPGAs. It then provides details on implementing DirectGMA in APIs like OpenGL, OpenCL, DirectX 9, 10 and 11 to enable efficient data transfers without CPU involvement.
Webinar: Building Embedded Applications from QtCreator with DockerBurkhard Stubert
This document discusses using Docker containers to build embedded apps from QtCreator more efficiently. It presents three solutions: 1) using virtual machines which is currently done but slow, 2) building apps in Docker containers on the local workstation which is faster, and 3) building apps in containers on remote workstations. The document then provides steps for setting up a development environment using Docker to build an app in QtCreator, including installing a Qt SDK in a container, configuring QtCreator, building the app using a Docker wrapper for CMake, and running the app on a Raspberry Pi device.
The document provides information about Intel's Perceptual Computing SDK and related hardware and software. It discusses the Creative Interactive Gesture Camera that can be used with the SDK for close-range interactivity. It also outlines key upcoming products, an overview of perceptual computing capabilities like facial tracking and gesture recognition, hardware and software requirements, supported programming languages and frameworks, and resources for using the SDK.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
The document discusses optimizing DirectCompute for graphics processing units (GPUs). It introduces DirectCompute as an API for general purpose GPU programming and explains how it maps better than traditional graphics APIs to the architecture of GPUs like AMD's GCN. Key optimization techniques covered include choosing a thread group size that is a multiple of 64 threads to fully utilize the SIMD hardware, and using thread group shared memory to improve data access speeds compared to off-chip memory. Bank conflicts in shared memory access are also discussed as something to avoid for best performance.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Hexagonal Architecture: The Standard for Qt Embedded ApplicationsBurkhard Stubert
This document discusses the hexagonal architecture pattern for Qt embedded HMIs. It begins with an introduction to ports-and-adapters architecture, then discusses using this pattern to create machine, business logic, and GUI components with different adapters for products, simulators, and tests. It explains how to create these components, connect them, and configure the overall application to be highly modular, testable and maintainable. Benefits of this hexagonal architecture include high testability, modularity, and maintainability, though it does add some complexity. Resources for further information are provided.
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
This document discusses the simulation, compilation, and debugging of OpenCL on the AMD Southern Islands GPU architecture using the Multi2Sim simulator. It describes Multi2Sim's methodology for full-system simulation and emulation of CPU and GPU architectures. It provides details on Multi2Sim's x86 emulator, OpenCL runtime on the host, and emulation of the Southern Islands GPU including its disassembler, emulator, and timing simulator.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
HSAIL is an intermediate language for parallel computing on HSA architectures. It is generated by high-level compilers and compiled to target ISAs. HSAIL supports features like shared virtual memory, barriers, and platform atomics. It allows mainstream languages like Java to benefit from parallel computing on GPUs without code changes by bridging the gap between CPU and accelerators. Tools are available to help with HSAIL development.
Recorded video here:
http://on-demand.gputechconf.com/siggraph/2017/video/sig1757-tristan-lorach-vkFX-effective-approach-for-vulkan-api.html
Vulkan is a complex low-level API, full of structures and dedicated objects. Using it may be tedious and often leads to complicated source code. We propose here a way to define and use Vulkan components in a convenient and readable way. Then we will show how this infrastructure allows to introduce and use higher concepts, such as Techniques, Passes; and even how to instantiate resources, render-targets right from within the effect, making it self-sufficient and consistent as a general description. The overall purpose of this open-source project is to improve and enhance the use of Vulkan API, while keeping its strength and flexibility. This project can run in two different ways: either as a compiler generating C++ code for you; or at runtime, to load effects and use them right away.
vkFx comes from a former project called nvFx, presented few years ago. While nvFx was intended to be Generic (OpenGL & D3D compliant), vkFx is Vulkan-specific: so the project is thin and doesn’t break important paradigms that Vulkan requires to stay powerful.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compiler Developers, by Yaxun Liu from the AMD Developer Summit (APU13) November 11-13, 2013.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
Vulkan is a modern, low-level graphics API designed for high performance. It is closer to the hardware than OpenGL and allows for explicit control and management of resources. Vulkan is multi-threading friendly and supports parallel graphics and compute workloads. It has a large number of components that must be explicitly created, managed, and synchronized by the application compared to higher-level APIs. These include devices, queues, command buffers, pipelines, descriptors, and memory. NVIDIA provides full support for Vulkan capabilities on its GPUs.
At Qt Day 2019 in Florence, Italy, I gave a presentation about "Using Qt under LGPLv3".
In the first part, I go through Section "4. Combined Works" and explain the obligations of LGPLv3 in detail.
When you combine your application with the Qt libraries, you can keep closed the source code of the application and publish the application and the Qt libraries under your license terms. You must satisfy the following conditions.
On a physical medium like a USB drive or DVD, you must provide the source code of the Qt libraries, the text of the LGPLv3 and GPLv3, the Copyright notices of the Qt libraries, a description of your modification of the Qt libraries and the installation information. If your application has a user interface, you must display the license texts of the LGPLv3 and GPLv3 and the Copyright notices in this user interface.
The installation information - a.k.a. the anti-tivoisation clause - describes how the user can build a modified version of Qt, how to install this version on the device, and how to run the application with the modified Qt version on the device. If you build a B2B product, you need not provide the installation information.
In the second part, I compare the costs of Qt Commercial and Qt LGPLv3. You must check the licenses for all packages of the (embedded) Linux system, no matter whether you use Qt Commercial or Qt LGPLv3. Qt commercial adds per-developer fees and per-unit fees (royalties). If you use Boot2Qt, Qt for Automation, Qt Safe Renderer or other commercial add-ons, Qt Commercial may well be worth the additional costs.
In the third part, I show how Yocto (bitbake) and Fossology help with the compliance checks. Fossology is a license and copyright scanner.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
Webinar: Building Embedded Applications from QtCreator with DockerBurkhard Stubert
This document discusses using Docker containers to build embedded apps from QtCreator more efficiently. It presents three solutions: 1) using virtual machines which is currently done but slow, 2) building apps in Docker containers on the local workstation which is faster, and 3) building apps in containers on remote workstations. The document then provides steps for setting up a development environment using Docker to build an app in QtCreator, including installing a Qt SDK in a container, configuring QtCreator, building the app using a Docker wrapper for CMake, and running the app on a Raspberry Pi device.
The document provides information about Intel's Perceptual Computing SDK and related hardware and software. It discusses the Creative Interactive Gesture Camera that can be used with the SDK for close-range interactivity. It also outlines key upcoming products, an overview of perceptual computing capabilities like facial tracking and gesture recognition, hardware and software requirements, supported programming languages and frameworks, and resources for using the SDK.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
The document discusses optimizing DirectCompute for graphics processing units (GPUs). It introduces DirectCompute as an API for general purpose GPU programming and explains how it maps better than traditional graphics APIs to the architecture of GPUs like AMD's GCN. Key optimization techniques covered include choosing a thread group size that is a multiple of 64 threads to fully utilize the SIMD hardware, and using thread group shared memory to improve data access speeds compared to off-chip memory. Bank conflicts in shared memory access are also discussed as something to avoid for best performance.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Hexagonal Architecture: The Standard for Qt Embedded ApplicationsBurkhard Stubert
This document discusses the hexagonal architecture pattern for Qt embedded HMIs. It begins with an introduction to ports-and-adapters architecture, then discusses using this pattern to create machine, business logic, and GUI components with different adapters for products, simulators, and tests. It explains how to create these components, connect them, and configure the overall application to be highly modular, testable and maintainable. Benefits of this hexagonal architecture include high testability, modularity, and maintainability, though it does add some complexity. Resources for further information are provided.
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
This document discusses the simulation, compilation, and debugging of OpenCL on the AMD Southern Islands GPU architecture using the Multi2Sim simulator. It describes Multi2Sim's methodology for full-system simulation and emulation of CPU and GPU architectures. It provides details on Multi2Sim's x86 emulator, OpenCL runtime on the host, and emulation of the Southern Islands GPU including its disassembler, emulator, and timing simulator.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
HSAIL is an intermediate language for parallel computing on HSA architectures. It is generated by high-level compilers and compiled to target ISAs. HSAIL supports features like shared virtual memory, barriers, and platform atomics. It allows mainstream languages like Java to benefit from parallel computing on GPUs without code changes by bridging the gap between CPU and accelerators. Tools are available to help with HSAIL development.
Recorded video here:
http://on-demand.gputechconf.com/siggraph/2017/video/sig1757-tristan-lorach-vkFX-effective-approach-for-vulkan-api.html
Vulkan is a complex low-level API, full of structures and dedicated objects. Using it may be tedious and often leads to complicated source code. We propose here a way to define and use Vulkan components in a convenient and readable way. Then we will show how this infrastructure allows to introduce and use higher concepts, such as Techniques, Passes; and even how to instantiate resources, render-targets right from within the effect, making it self-sufficient and consistent as a general description. The overall purpose of this open-source project is to improve and enhance the use of Vulkan API, while keeping its strength and flexibility. This project can run in two different ways: either as a compiler generating C++ code for you; or at runtime, to load effects and use them right away.
vkFx comes from a former project called nvFx, presented few years ago. While nvFx was intended to be Generic (OpenGL & D3D compliant), vkFx is Vulkan-specific: so the project is thin and doesn’t break important paradigms that Vulkan requires to stay powerful.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compiler Developers, by Yaxun Liu from the AMD Developer Summit (APU13) November 11-13, 2013.
1. CUDA provides a programming environment and APIs that allow developers to leverage GPUs for general purpose computing. The CUDA C API offers both a high-level runtime API and a lower-level driver API.
2. CUDA programs define kernels that execute many parallel threads on the GPU. Threads are organized into blocks that can cooperate through shared memory, and blocks are organized into grids.
3. The CUDA memory model includes a hierarchy from fast per-thread registers to slower shared, global, and host memories. This hierarchy allows threads within blocks to communicate efficiently through shared memory.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
Vulkan is a modern, low-level graphics API designed for high performance. It is closer to the hardware than OpenGL and allows for explicit control and management of resources. Vulkan is multi-threading friendly and supports parallel graphics and compute workloads. It has a large number of components that must be explicitly created, managed, and synchronized by the application compared to higher-level APIs. These include devices, queues, command buffers, pipelines, descriptors, and memory. NVIDIA provides full support for Vulkan capabilities on its GPUs.
At Qt Day 2019 in Florence, Italy, I gave a presentation about "Using Qt under LGPLv3".
In the first part, I go through Section "4. Combined Works" and explain the obligations of LGPLv3 in detail.
When you combine your application with the Qt libraries, you can keep closed the source code of the application and publish the application and the Qt libraries under your license terms. You must satisfy the following conditions.
On a physical medium like a USB drive or DVD, you must provide the source code of the Qt libraries, the text of the LGPLv3 and GPLv3, the Copyright notices of the Qt libraries, a description of your modification of the Qt libraries and the installation information. If your application has a user interface, you must display the license texts of the LGPLv3 and GPLv3 and the Copyright notices in this user interface.
The installation information - a.k.a. the anti-tivoisation clause - describes how the user can build a modified version of Qt, how to install this version on the device, and how to run the application with the modified Qt version on the device. If you build a B2B product, you need not provide the installation information.
In the second part, I compare the costs of Qt Commercial and Qt LGPLv3. You must check the licenses for all packages of the (embedded) Linux system, no matter whether you use Qt Commercial or Qt LGPLv3. Qt commercial adds per-developer fees and per-unit fees (royalties). If you use Boot2Qt, Qt for Automation, Qt Safe Renderer or other commercial add-ons, Qt Commercial may well be worth the additional costs.
In the third part, I show how Yocto (bitbake) and Fossology help with the compliance checks. Fossology is a license and copyright scanner.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating
About 94% of AI Adopters are planning to use containers in the next 1 year. What’s driving this exponential growth? Faster time to deployment and Faster AI workload processing are the two major reasons. You can use GPUs in big data applications such as machine learning, data analytics, and genome sequencing. Docker containerization makes it easier for you to package and distribute applications. You can enable GPU support when using YARN on Docker containers. In this talk, I will demonstrate how Docker accelerates the AI workload development and deployment over the IoT Edge devices in efficient manner
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/08/open-standards-powering-the-future-of-embedded-vision-a-presentation-from-the-khronos-group/
Neil Trevett, President of the Khronos Group and Vice President of Developer Ecosystems at NVIDIA, presents the “Open Standards: Powering the Future of Embedded Vision” tutorial at the May 2022 Embedded Vision Summit.
Open standards play an important role in enabling interoperability for efficient deployment of vision-based systems. In this session, Trevett shares an update on the family of Khronos Group standards for programming and deploying accelerated inferencing and embedded vision, including OpenCL, Vulkan Safety Critical, OpenVX, SYCL and NNEF.
Trevett discusses the evolving roadmap for these standards and provides insights to help you understand which standards are relevant to your projects. In addition, he introduces the new Khronos Embedded Camera API initiative. Trevett outlines the technical direction of the Embedded Camera API working group to create an open standard to streamline the integration and control of sophisticated embedded camera systems, and highlights how attendees can participate in this important industry initiative.
Scaling Docker Containers using Kubernetes and Azure Container ServiceBen Hall
This document discusses scaling Docker containers using Kubernetes and Azure Container Service. It begins with an introduction to containers and Docker, including how containers improve dependency and configuration management. It then demonstrates building and deploying containerized applications using Docker and discusses how to optimize Docker images. Finally, it introduces Kubernetes as a tool for orchestrating containers at scale and provides an example of deploying a containerized application on Kubernetes in Azure.
UplinQ - qualcomm® hexagon™ sdk optimize your multimedia solutionsSatya Harish
The Qualcomm Hexagon SDK allows developers to optimize multimedia solutions by offloading compute tasks from the application processor to the Hexagon DSP. It provides tools like FastRPC for remote procedure calls, dynamic loading to add code/data at runtime, an Eclipse plugin for debugging, and optimized Hexagon libraries. The SDK also supports audio, voice, and computer vision applications and includes tools for installation, development, and testing.
The Qualcomm Hexagon SDK allows developers to optimize multimedia solutions by offloading compute tasks from the application processor to the Hexagon DSP. It provides tools like FastRPC for remote procedure calls, dynamic loading to add code/data at runtime, an Eclipse plugin for debugging, and optimized Hexagon libraries. The SDK also supports audio, voice, and computer vision applications and includes hardware development platforms, libraries, and a toolchain.
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
The idea for implementing a brand new Rust driver for ScyllaDB emerged from an internal hackathon in 2020. The initial goal was to provide a native implementation of a CQL driver, fully compatible with Apache Cassandra™, but also contain a variety of Scylla-specific optimizations. The development was later continued as a Warsaw University project led by ScyllaDB.
Now it's an officially supported driver with excellent performance and a wide range of features. This session shares the design decisions taken in implementing the driver and its roadmap. It also presents a forward-thinking plan to unify other Scylla-specific drivers by translating them to bindings to our Rust driver, using work on our C++ driver as an example.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
The document discusses transaction-based hardware-software co-verification using emulation. It describes how traditional cycle-based co-verification is slow due to communication overhead between the testbench and emulator. Transaction-based co-verification improves speed by only synchronizing when required and allowing parallel execution. Transactors are used to convert high-level commands from the testbench to a bit-level protocol for the emulator. This allows emulation speeds of tens of MHz, orders of magnitude faster than cycle-based. An example transactor for a virtual memory is presented.
Deploying Cloud Native Red Team Infrastructure with Kubernetes, Istio and Envoy Jeffrey Holden
This document discusses deploying cloud native red team infrastructure using Kubernetes, Istio and Envoy. It provides introductions to Larry Suto and Jeff Holden and their backgrounds. It then covers goals of being automated, portable and scriptable. Key points covered include using Kubernetes for its infrastructure as code capabilities. It discusses concepts like Docker, Kubernetes, Kops, External DNS, SSL Cert Manager and recipes for containerizing tools like Cobalt Strike, Merlin and configuring deployments.
The implementation phase involves materializing ideas from analysis and design into the final solution. The authors implemented rejuvenation on three domains using warm and cold methods based on time and prediction policies. They also simulated rejuvenation of failing nodes and implemented Petri net modeling. KVM was selected as the virtualization platform and run on CentOS. C was used as the programming language and key libraries were included. NFS was configured to enable sharing of VM images between servers to allow live migration.
The document describes NVIDIA's DRIVE PX 2, an AI supercomputer purpose-built for self-driving cars. It has 12 CPU cores, a Pascal GPU providing 8 TFLOPS of processing power and 24 DL TOPS for deep learning. The DRIVE PX 2 features various interfaces to connect to sensors, displays and development tools. It also includes software like NVIDIA DRIVEWORKS and supports AUTOSAR for automotive software development. The DRIVE PX 2 is designed to help developers create self-driving applications and migrate them from testing to production vehicles.
This document summarizes Kazuaki Ishizaki's keynote presentation at the Fourth International Symposium on Computing and Networking (CANDAR'16) on transparent GPU exploitation for Java. The presentation covered Ishizaki's research history developing compilers and optimizing code for GPUs. It described a Java just-in-time compiler that can generate optimized GPU code from parallel loops in Java programs without requiring programmers to manage low-level GPU operations like data transfers and memory allocation themselves. The compiler implements optimizations like array alignment, read-only caching, and reducing data copying to improve GPU performance. The goal is to make GPU programming easier and more portable across hardware for Java programmers.
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
This implemented DSP system utilizes TCP socket communication. Upon message reception, it decides the appropriate process to be executed based on cases which can be categorized as follows:
1) image capture
2) image transfer
3) image processing
4) sensor calibration
A user-friendly MATLAB GUI, named DIPeth, facilitates the system's control.
HKG15-300: Art's Quick Compiler: An unofficial overviewLinaro
HKG15-300: Art's Quick Compiler: An unofficial overview
---------------------------------------------------
Speaker: Matteo Franchin
Date: February 11, 2015
---------------------------------------------------
★ Session Summary ★
One of the important technical novelties introduced with the recent release of Android Lollipop is the replacement of Dalvik, the VM which was used to execute the bytecode produced from Java apps, with ART, a new Android Run-Time. One interesting aspect in this upgrade is that the use of Just-In-Time compilation was abandoned in favour of Ahead-Of-Time compilation. This delivers better performance [1], also leaving a good margin for future improvements. ART was designed to support multiple compilers. The compiler that shipped with Android Lollipop is called the “Quick Compiler”. This is simple, fast, and is derived from Dalvik’s JIT compiler. In 2014 our team at ARM worked in collaboration with Google to extend ART and its Quick Compiler to add support for 64-bit and for the A64 instruction set. These efforts culminated with the recent release of the Nexus 9 tablet, the first 64-bit Android product to hit the market. Despite Google’s intention of replacing the Quick Compiler with the so-called “Optimizing Compiler”, the job for the the Quick Compiler is not yet over. Indeed, the Quick Compiler will remain the only usable compiler in Android Lollipop. Therefore, all competing parties in the Android ecosystem have a huge interest in investigating and improving this component, which will very likely be one of the battlegrounds in the Android benchmark wars of 2015. This talk aims to give an unofficial overview of ART’s Quick compiler. It will first focus on the internal organisation of the compiler, adopting the point of view of a developer who is interested in understanding its limitations and strengths. The talk will then move to exploring the output produced by the compiler, discussing possible strategies for improving the generated code, while keeping in mind that this component may have a limited life-span, and that any long-term work would be better directed towards the Optimizing Compiler. [1] The ART runtime, B. Carlstrom, A. Ghuloum, and I. Rogers, Google I/O 2014,https://www.youtube.com/watch?v=EBlTzQsUoOw
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250804
Video: https://www.youtube.com/watch?v=iho-e7EPHk0
Etherpad: N/A
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Flutter provides an excellent way to build Android, iOS, web and desktop apps, but what about the back end services? Full stack Dart is all about using that investment in Dart programming to build the services used by applications, whether it's in the cloud or on the Internet of Things. This presentation will look at the tradeoffs between just in time (JIT) and ahead of time (AOT) compilation, Dart on Docker, the Functions Framework for Dart, Profiling and Performance Management. Choices of back end architecture (x86_64 vs Arm) will also be examined, along with some of the challenges this can present for Continuous Delivery.
Android apps can run on Intel platforms with little to no changes needed. Most Dalvik apps will work directly, and NDK apps can be recompiled for x86 with no code changes often needed. Developers can target multiple platforms including x86 by setting APP_ABI to "all" or specifying individual ABIs. Third party game engines and libraries often support x86 as well. Intel provides tools to help Android development including HAXM for faster emulation, TBB for multi-threading, and GPA for performance analysis.
The document discusses VPU and GPGPU computing. It explains that a VPU is a visual processing unit, also known as a GPU. GPUs are massively parallel and multithreaded processors that are better than CPUs for tasks like machine learning and graphics processing. The document then discusses GPU architecture, memory, and programming models like CUDA. It provides examples of GPU usage and concludes that GPGPU is used in fields like machine learning, robotics, and scientific computing.
Similar to Optimizing NN inference performance on Arm NEON and Vulkan (20)
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
3. Our upcoming Arm AI Tech Talks
Date Title Host
September
21st Optimizing NN inference performance on Arm NEON and Vulkan using the Ailia SDK Ax Inc
October 5th EON Tuner: AutoML for real-world embedded devices Edge Impulse
October 28th ARM架构端侧AI视觉算法的开发到部署 (Development to Deployment of Endpoint AI vision
Algorithms Based on Arm Architecture)
ICE TECH
November
2nd Getting started with running Machine Learning on Arm Ethos-U55 Arm
Visit: developer.arm.com/techtalks
6. Company Profile
Name ax Inc.
Location Tokyo, Japan
CEO TERADA Takehiko
Business
・Development and provide “ailia SDK”
・AI Consulting, Training AI models
and more AI businesses
7. “AI everywhere ”
We believe that AI will be used on every device in the near future.
We are developing fast SDKs and constantly researching the latest AI models
to help accelerating the evolution of AI era.
ax Inc. aims to provide tools to implement the latest AI capabilities to solve
real-world problems.
8. There are many challenges
in implementing AI for various devices.
9. Those challenges include:
• Wide variety of AI models
• Multi platform
• Various programing languages
• Long-term API consistency
• Performance optimization for each devices
and more.
10. ailia SDK
ailia SDK is an AI framework leveraging CPU and GPU to achieve high-performance AI inference
It supports ONNX (opset 10 & 11) and enables high-performance inference using NEON and Vulkan
It offers over 140 pre-trained models, ready to be integrated into our client’s application
https://ailia.jp/en
11. Benefits when implementing AI to Edge devices
install
ailia SDK
download
ailia MODELS
execute
sample code
result
verification
model
conversion
install
dependent
libraries
search
optimal
model
clone model
repository
setup edge
environment
setup
development
environment
NG NG
NG
result
verification
Implement
pre and post
processing
Conventional
ailia SDK
12. ailia SDK Architecture
Vulkan Metal NEON
Accelerator (Convolution, Pooling, Resize, Add, and more.)
Runtime Graph Optimization
API (C++, Python, C#, JNI)
ONNX (opset=10, 11) (supporting over 100 layers)
ailia SDK
ailia MODELS
13. ailia MODELS
Over 140 models compatible with ailia SDK are publicly available on github
You can easily try out the latest models such as YOLOv4, MIDAS and PaddleOCR
https://github.com/axinc-ai/ailia-models
14. ailia MODELS
ailia MODELS has samples for Python, C ++ and Unity.
You can use accelerated inference using Vulkan in any environment.
Hand Detection Depth Estimation
16. NEON implementation in AI
Since the amount of processing is large for AI compared to general applications, significant speedup is
required.
Thanks to the high degree of parallelism in layer operations, there is a lot of room for optimization using
SIMD instructions such as NEON.
There are many types of layers to implement and possibilities to use various optimization techniques.
17. NEON overview
NEON provides scalar/vector instructions and registers for Arm CPUs
It is possible to perform parallel calculation in 128-bit units, as well as FP32, up to 4 elements
simultaneously
Float 4 Parallel SIMD
instructions
128 bit Data
128 bit Data
18. Basic techniques
Replace code branching with a bit select instruction
Process number sign using by bit manipulation
Reorder data structure (load / store)
// save sign bit from input value.
mask = vdupq_n_u32(1<<31);
sign = vandq_u32(cast_u32(x), mask);
v = vabsq_f32(x);
// .. any operation with abs value ..
// restore sign bit with xor
res = cast_f32(veorq_u32(cast_u32(v), sign));
// remove branch with bit select
// dst[i]=(src[i]<0.0f)?src[i]*alpha:src[i];
val = vld1q_f32(src);
sel = vcltq_f32(val, zero);
mod = vmulq_n_f32(val, alpha);
vst1q_f32(dst, vbslq_f32(sel, mod, val));
// reorder data with structure load
x[] = {
11, 12, 13, 14,
21, 22, 23, 24,
31, 32, 33, 34,
41, 42, 43, 44,
};
v = vld4q_f32(x);
// v.val[0] : { 11, 21, 31, 41 }
// v.val[1] : { 12, 22, 32, 42 }
// v.val[2] : { 13, 23, 33, 43 }
// v.val[3] : { 14, 24, 34, 44 }
19. Approximation
NEON implementation using approximate expressions for high-level functions such as exp, log, and erf
// log(x) = log((2^n) * z) = n*log(2) + log(z)
// log(z) ≒ 2 * (w + (w^3)/3 + (w^5)/5 + ..) : w = (z-1)/(z+1)
log2 = 0.6931471805599453f;
n = pick_exponent(x);
z = pick_fractional(x);
w = (z-1.0) / (z+1.0);
ww = w*w;
r = (1.0/7.0) + (1.0/9.0)*ww;
r = (1.0/5.0) + r*ww;
r = (1.0/3.0) + r*ww;
r = (1.0 + r*ww);
r = r*w; // w + (w^3)/3 + (w^5)/5 + (w^7)/7 + (w^9)/9
return (n*log2 + r*2);
20. Threading
Taking advantage of Arm big.little technology, performance gain can be achieved by efficiently assigning
jobs to different cores.
Adjust the processing unit according to the cache size
Diving a task in units of equal size
does not fit big.little SoC. design
Assign more processing-intensive
tasks to big cores, and smaller ones
to little cores.
21. Benchmark
Comparison of inference time with NEON enabled and disabled
SoC Model Improvement (w/wo NEON)
Snapdragon 888
ResNet50 2.15 times faster
YOLOv3 3.90 times faster
Exynos 9820
ResNet50 3.08 times faster
YOLOv3 2.86 times faster
22. Future work
The vector length of NEON is 128 bit, but 512 bit vector operation can be used in SVE2 added in Armv9-A
Going forward, ailia SDK will continue to support the latest instruction sets
24. Benefits of Vulkan
Support for all major GPUs
Support for all major OSs
Windows, Android, Linux
Easy installation for GPU inference
Being widely used for gaming, it only requires standard drivers to run
Little additional disk space usage
ailia_vulkan.dll is only 2.8MB
25. AI Inference Acceleration with Vulkan
Fast AI inference achieved using Runtime Graph Optimization and Layer Fusion
Implementation of the Optimized Logic for GEMM using Vulkan’s Compute Shader
Implementation of layers such as Convolution or Pooling using Vulkan’s Compute Shader
Implementation of the Winograd Algorithm with shaders to accelerate the heavy Convolution layer
26. Roles between CPU and GPU
Allocate GPU memory
Send data to GPU memory
Create compute kernel on GPU
Dispatch compute kernel on GPU
Fetch Compute result from GPU
memory
Compute using compute kernel
Computing API | GPU Driver
CPU GPU
27. Code required for Vulkan
Create VkInstance
Select VkPhysicalDevice
Create VkDevice
Get VkQueue
Allocate VkDeviceMemory
Bind vkDeviceMemory to VkBuffer
Send data to device from host
Configure VkShaderModule
Create VkDescriptorSetLayout
Create VkPipelineLayout
Create VkPipeline
Create VkDescriptorPool
Create VkDescriptorSet
Create VkCommandPool
Create VkCommandBuffer
Regist VkPipeline to VkCommandBuffer
Regist VkDescriptorSet to VkCommandBuffer
Call vkCmdDispatch
Executing and waiting for VkCommandBuffer to complete
Data transfer from device to host
28. Example of GPU Computing
Add the corresponding elements of A and B of the same size and write to the corresponding element of
the output of the same size
Input A [ z=4, y=360, x=640 ]
Input B [ z=4, y=360, x=640 ]
Output [ z=4, y=360, x=640 ]
+
30. Example of GPU Computing : Host code (CPU code)
// Instance initialization, device selection, buffer allocation, shader module construction, etc. are
omitted.
vkBeginCommandBuffer(cmdBuf, &beginInfo); // Start recording command buffer
// Registered pipeline in command buffer --Shader module is registered in pipeline
vkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
// Registered descriptor set in command buffer --I / O buffer allocation is registered in descSet
vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, descSet);
// Specify how many work loops to start
// Start with the most recently configured pipeline and descriptor set
// Numerical example of one-dimensional processing of z = 4, y = 360, x = 640 with localsize = 64
vkCmdDispatch(cmdBuf, (4*360*640)/64, 1, 1);
vkEndCommandBuffer(cmdBuf); // Set the end of command buffer construction
// Pass the command buffer to the device and run the shader --cmdBuf is registered in submitInfo
vkQueueSubmit(queue, 1, &submitInfo, NULL);
// Waiting for processing completion and data collection are omitted
31. Workgroups and Threads (Invocation)
Write how many kernels to boot in the host (CPU) code (groupCount)
Describe what each thread does in the device (GPU) code (shader)
Use localsize to specify the number of threads (invocations)
Thread
Workgroup
vkCmdDispatch
Workgroup
Thread Thread Thread
CPU
GPU
groupCount
localsize
32. Arm GPU HW configuration example
https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g76-gpu
Configurable from 4 to 20 shader cores
delivering largest capability for a Mali GPU
3 engines per shader core
8 execution lanes per engine
33. Correspondence with API
vkCmdDispatch(.., groupCountX, groupCountY, groupCountZ);
Kernel boot process with host code
Launch (groupCountX * groupCountY * groupCountZ) workgroups
One workgroup is assigned to one execution engine
If you can fill all the execution engines, make full use of the GPU
layout(localsize_x=64) in;
Device code thread number specification part
Specifies the number of local threads in a workgroup
If the number of threads issued in the execution engine can be filled, the execution engine will be fully utilized.
34. Relation between localsize and dispatch
If the local size is 64 and the image size is 640 * 360, you need to run (360 * 640) / 64 workgroups to run
the entire image
640
360
localsize=64
1
groupCount=(360
*640)/64
35. How to choose localsize
The limit of localsize is defined by maxComputeWorkgroupSize
maxComputeWorkgroupSize = {384,384,384} (Mali G76)
maximum size of a local compute workgroup per dimension
maxComputeWorkGroupInvocations = 384 (Mali G76)
maximum total number of compute shader invocations in a single local workgroup
Arm Mali GPU Best Practices Developer Guide said
“Use 64 as a baseline workgroup size. Do not use more than 64 threads per workgroup.”
Localsize adjustment required for complex and heavy kernels
36. Time spent by localsize
Error if maxComputeWorkgroupSize is
exceeded
Performance saturation at localsize> = 32
Because it uses only one shader module
localsize = 1 is the worst performance
37. GEMM
General matrix multiply
Multiplication of 2D dense matrix
Often used in scientific computing (computer simulation)
Patterson & Hennessy "Computer Organization and Design"
1426-page textbooks with the story of speeding up GEMM
Matrix A Matrix B
* Matrix C
39. Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
} 1 thread is responsible for one output element
40. Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
localsize is 8 x 8 = 64
Create threads with 2D blocks
41. Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
Both A and B read into shared memory
42. Basic implementation of GEMM
// I / O definition omitted (when trans_a == false && trans_b == false)
#define BLOCK_SIZE 8
layout(local_size_x=BLOCK_SIZE, local_size_y=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared float sa[ BLOCK_SIZE * BLOCK_SIZE ];
shared float sb[ BLOCK_SIZE * BLOCK_SIZE ];
void main()
{
uint lx = gl_LocalInvocationID.x; uint ly = gl_LocalInvocationID.y;
uint dx = gl_WorkGroupID.x * BLOCK_SIZE; uint dy = gl_WorkGroupID.y * BLOCK_SIZE;
float sum = 0.0;
for (uint k=0; k<TOTAL_K; k+=BLOCK_SIZE) {
sa[ ly * BLOCK_SZIE + lx ] = src_a.data[ (dy + ly) * src_a_width + ( k+lx) ];
sb[ ly * BLOCK_SIZE + lx ] = src_b.data[ ( k + ly) * src_b_width + (dx+lx) ];
barrier();
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum += sa[ ly * BLOCK_SIZE + i ] * sb[ i * BLOCK_SIZE + lx ];
}
barrier();
}
// Exception handling of fractional blocks is omitted
dst.data[ (dy+ly) * dst_width + (dx+lx) ] = sum;
}
In the barrier, calculate the output element by
referring to the shared memory prepared by
another thread in the workgroup.
43. Features of GEMM old implementation
1 thread is responsible for 1 output element
localsize is 8 * 8 = 64
Store A and B together in shared memory
Put a barrier () and refer to the shared memory read by another thread in the workgroup.
44. Related research about GEMM
V.Volkov and J.Demmel “Benchmarking GPUs to Tune Dense Linear Algebra”
https://www.cs.colostate.edu/~cs675/volkov08-sc08talk.pdf
• Increasing the localsize will make it easier to cause register spills.
• shared memory is slower than registers
• Putting only B in shared memory and putting A in a register made it faster.
46. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
47. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
1 thread is responsible for BLOCK_SIZE output elements
48. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
localsize is a one-dimensional thread with BLOCK_SIZE
= 16
49. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
Only B is stored in shared memory
50. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
A is a normal variable and expects register
allocation
51. Optimized implementation of GEMM
// I / O definition omitted
#define BLOCK_SIZE 16
layout(local_size_x=BLOCK_SIZE) in;
// shared memory can be commonly referenced from workgroup
shared vec4 sb[ BLOCK_SIZE ];
void main()
{
// Coordinate calculation part omitted
float sum[BLOCK_SIZE];
for (int i=0; i<BLOCK_SIZE; ++i) { sum[i] = 0.0f; }
for (uint k=0; k<TOTAL_K; k+=4) {
sb[ lx ] = load_b(dx+lx, k);
barrier();
vec4 wa = load_a(dy+lx, k);
for (uint i=0; i<BLOCK_SIZE; ++i) {
sum[i] += dot(wa, sb[i]);
}
barrier();
}
// Exception handling of fractional blocks is omitted
for (int i=0; i<BLOCK_SIZE; ++i) {
dst.data[ (dy+lx) * dst_width + (dx+i) ] = sum[i];
}
}
Expected to use 4 arithmetic units from 1 thread with vec4
type packed with 4 floats
52. Features of new GEMM implementation
1 thread is responsible for the BLOCK_SIZE output element
localsize is BLOCK_SIZE = 16
Only B stored in shared memory
A is a normal variable and expects to use register
Use vec4 from 1 thread to 4 arithmetic units
53. Performance comparison in BERT
BERT is a Transformer-based natural language processing model
The contents are a mass of GEMM = MatMul
Inference time 2.5 to 3.5 times faster with optimized implementation
Based on our benchmark, it is for example 3.53 times faster on Arm Mali G76
54. The Winograd Algorithm
The arithmetic complexity of the convolution layer can be reduced by applying transforms to the inputs
and the output
Although the transforms are a little demanding, the matrix multiplication which takes up most of the
computation time can be reduced
https://arxiv.org/abs/1509.09308
For a 3x3 convolution, the number of multiply operations can be reduced from 36 to 16
Filter weights
Image
Transform
Transform
Matrix
multiplication
InvTransform
55. Matrix Multiplication Optimization
A GEMM block size is M=N=K=4
(in some architectures M=8)
Computation of the matrix product for each block in every invocation
No synchronization of workgroup
16-bytes memory alignment and reading as vec4 in a single instruction
Memory access optimization by keeping the vector format as much as possible in later processing
56. Command Buffer Management
The overhead of vkQueueSubmit being large, sub threading is used to reduce the load
1. Start a new thread to process the command buffer
2. When a request is made through ailia’s API, the command buffer is added to the sub thread’s queue
3. If the sub thread is idle, the command buffer queue is sent all at once to vkQueueSubmit
ready_command_buffer_vector
vkQueueWaitIdle
vkQueueSubmit
Graph Parse
Convolution + Activation
Main Thread Sub Thread
Pooling
58. Summary
High-performance AI inference can be achieved using NEON and Vulkan regardless of which device or OS
and with only standard drivers
Easy integration into any application through ailia SDK
Fast inference using Vulkan has already been implemented for more than 140 AI models
59. “ailia” resolves all problems
ailia SDK ailia MODELS
ailia TRAINER
ailia VISION ailia APPS
ailia MEDIA
60. Reference
ax Inc. provides total solutions related to AI, from consulting to model, SDK, AI applications / systems
development and support. Should you have any inquiry, feel free to get in touch with us.
ax Inc. home page https://axinc.jp/en/(Inquiry)
ailia MEDIA https://medium.com/axinc-ai
ailia SDK https://ailia.jp/en/(Free trial version available)
ailia MODELS https://github.com/axinc-ai/ailia-models
ailia AI Showcase (Video) https://www.youtube.com/watch?v=lRnWX1rDRQU
ailia AI Showcase (Android) https://play.google.com/store/apps/details?id=jp.axinc.ailia_ai_showcase