This document discusses parallelizing computer vision algorithms using GPGPU computing. It begins with an introduction to multicore computing and GPUs. It explains that as CPU clock speeds can no longer increase due to power constraints, the industry has shifted to multicore CPUs and GPUs to continue improving performance. Computer vision algorithms are well-suited to parallelization on GPUs due to their massive data processing needs. The document reviews GPU architectures from Nvidia, Qualcomm, AMD, and ARM that can be used to accelerate computer vision. It also discusses parallel programming frameworks for GPUs like CUDA, OpenCL, and OpenACC.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
Slide at OpenStack Summit 2018 Vancouver
Session Info and Video: https://www.openstack.org/videos/vancouver-2018/can-we-boost-more-hpc-performance-integrate-ibm-power-servers-with-gpus-to-openstack-environment
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/deploying-pytorch-models-for-real-time-inference-on-the-edge-a-presentation-from-nomitri/
Moritz August, CDO at Nomitri GmbH, presents the “Deploying PyTorch Models for Real-time Inference On the Edge” tutorial at the May 2021 Embedded Vision Summit.
In this presentation, August provides an overview of workflows for deploying compressed deep learning models, starting with PyTorch and creating native C++ application code running in real-time on embedded hardware platforms. He illustrates these workflows on smartphones with real-world examples targeting ARM-based CPU, GPUs, and NPUs as well as embedded chips and modules like the NXP i.MX8+ and NVIDIA Jetson Nano.
August examines TorchScript, architecture-side optimizations, quantization and common pitfalls. Additionally, he shows how the PyTorch deployment workflow can be extended to conversion to ONNX and quantization of ONNX models using an ONNX Runtime. On the application side, he demonstrates how deployed models can be integrated efficiently into a C++ library that runs natively on mobile and embedded devices and highlights known limitations.
Slides at OpenStack Summit 2017 Sydney
Session Info and Video: https://www.openstack.org/videos/sydney-2017/100gbps-openstack-for-providing-high-performance-nfv
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/03/opencv-past-present-and-future-a-presentation-from-opencv-org/
For more information about edge AI and vision, please visit:
http://www.edge-ai-vision.com
Gary Bradski, the President and CEO of OpenCV.org, delivers the presentation “OpenCV: Past, Present and Future” at the Edge AI and Vision Alliance’s March 2020 Vision Industry and Technology Forum. Bradski shares the latest developments in the OpenCV open source library for computer vision and deep learning applications, as well as where OpenCV is heading.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2019-alliance-vitf-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President of Developer Ecosystems at NVIDIA, delivers the presentation "Current and Planned Standards for Computer Vision and Machine Learning" at the Embedded Vision Alliance's December 2019 Vision Industry and Technology Forum. Trevett shares updates on recent, current and planned Khronos standardization activities aimed at streamlining the deployment of embedded vision and AI.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsJenny Midwinter
Slides from Ottawa Machine Learning Meetup from January 16, 2016.
Pierre Paulin, Director of R&D at Synopsys (Embedded Vision Subsystems) , will be will be making a presentation on:
“Applying Deep Learning Vision Technology to Low-Cost, Low-Power Embedded Systems: An Industrial Perspective”
Hire a Machine to Code - Michael Arthur Bucko & Aurélien NicolasWithTheBest
Bucko and Nicolas share their vision and products, as well as their explanation of what Deckard is. They provide insights from the software development team. They believe coding can resolve problems that we face. Specifically, source coding is the solution that they teach you and they have hopes for in fixing human errors.
Michael Arthur Bucko & Aurélien Nicolas
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-mallick
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Satya Mallick, Interim CEO of OpenCV.org, presents the "OpenCV: Current Status and Future Plans" tutorial at the May 2019 Embedded Vision Summit.
With over two million downloads per week, OpenCV is the most popular open source computer vision library in the world. It implements over 2500 opt- imized algorithms, works on all major operating systems, is available in multiple languages and is free for commercial use.
This talk primarily provides a technical update on OpenCV: What’s new in OpenCV 4.0? What is the Graph API? Why are we so excited about the Deep Neural Network (DNN) module in OpenCV? (Short answer: It is one of the fastest inference engines on the CPU.)
Mallick also shares plans for the future of OpenCV, including new algorithms that the organization plans to add through the Google Summer of Code this year. And he briefly shares information on the new Open Source Vision Foundation (OSVF), on OpenCV’s sister organizations, CARLA and Open3D, and on some of the initiatives planned by these organizations.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of Khronos and Vice President at NVIDIA, presents the "OpenVX Hardware Acceleration API for Embedded Vision Applications and Libraries" tutorial at the May 2014 Embedded Vision Summit.
This presentation introduces OpenVX, a new application programming interface (API) from the Khronos Group. OpenVX enables performance and power optimized vision algorithms for use cases such as face, body and gesture tracking, smart video surveillance, automatic driver assistance systems, object and scene reconstruction, augmented reality, visual inspection, robotics and more.
OpenVX enables significant implementation innovation while maintaining a consistent API for developers. OpenVX can be used directly by applications or to accelerate higher-level middleware with platform portability. OpenVX complements the popular OpenCV open source vision library that is often used for application prototyping.
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
GPGPU -- это использование графического процессора (GPU) для выполнения общих вычислений, которые обычно проводит центральный процессор (CPU). Благодаря большим вычислительным ресурсам GPU, данный подход позволяет ускорить некоторые приложения в десятки раз по сравнению с традиционным CPU. Принимая во внимание, что GPU есть во множестве современных устройств, данный подход может стать полезных инструментом для программиста, заботящегося о производительности своих программ. Доклад является введением в технологию GPGPU. В ходе презентации, обсуждаются различия между CPU и GPU на аппаратном уровне и объясняется, как эти различия привели к разным моделям программирования этих устройств. Будут рассмотрены классы задач, которые хорошо ускоряются при помощи GPGPU, и когда GPU может оказаться медленнее чем CPU. Доклад не фокусируются на каком-то определенном GPGPU API (OpenCL, CUDA и т.д.) и не требует от слушателей предварительных знаний аппаратуры GPU или CPU.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-trevett
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President at NVIDIA, presents the "APIs for Accelerating Vision and Inferencing: An Industry Overview of Options and Trade-offs" tutorial at the May 2019 Embedded Vision Summit.
The landscape of SDKs, APIs and file formats for accelerating inferencing and vision applications continues to evolve rapidly. Low-level compute APIs, such as OpenCL, Vulkan and CUDA are being used to accelerate inferencing engines such as OpenVX, CoreML, NNAPI and TensorRT, being fed by neural network file formats such as NNEF and ONNX.
Some of these APIs, like OpenCV, are vision-specific, while others, like OpenCL, are general-purpose. Some engines, like CoreML and TensorRT, are supplier-specific, while others such as OpenVX, are open standards that any supplier can adopt. Which ones should you use for your project? Trevett answers these and other questions in this presentation.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
Time critical multitasking for multicoreijesajournal
This paper presents the research work on multicore microcontrollers using parallel, and time critical
programming for the embedded systems. Due to the high complexity and limitations, it is very hard to work
on the application development phase on such architectures. The experimental results mentioned in the
paper are based on xCORE multicore microcontroller form XMOS®. The paper also imitates multi-tasking
and parallel programming for the same platform. The tasks assigned to multiple cores are executed
simultaneously, which saves the time and energy. The relative study for multicore processor and multicore
controller concludes that micro architecture based controller having multiple cores illustrates better
performance in time critical multi-tasking environment. The research work mentioned here not only
illustrates the functionality of multicore microcontroller, but also express the novel technique of
programming, profiling and optimization on such platforms in real time environments.
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
Slide at OpenStack Summit 2018 Vancouver
Session Info and Video: https://www.openstack.org/videos/vancouver-2018/can-we-boost-more-hpc-performance-integrate-ibm-power-servers-with-gpus-to-openstack-environment
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/10/deploying-pytorch-models-for-real-time-inference-on-the-edge-a-presentation-from-nomitri/
Moritz August, CDO at Nomitri GmbH, presents the “Deploying PyTorch Models for Real-time Inference On the Edge” tutorial at the May 2021 Embedded Vision Summit.
In this presentation, August provides an overview of workflows for deploying compressed deep learning models, starting with PyTorch and creating native C++ application code running in real-time on embedded hardware platforms. He illustrates these workflows on smartphones with real-world examples targeting ARM-based CPU, GPUs, and NPUs as well as embedded chips and modules like the NXP i.MX8+ and NVIDIA Jetson Nano.
August examines TorchScript, architecture-side optimizations, quantization and common pitfalls. Additionally, he shows how the PyTorch deployment workflow can be extended to conversion to ONNX and quantization of ONNX models using an ONNX Runtime. On the application side, he demonstrates how deployed models can be integrated efficiently into a C++ library that runs natively on mobile and embedded devices and highlights known limitations.
Slides at OpenStack Summit 2017 Sydney
Session Info and Video: https://www.openstack.org/videos/sydney-2017/100gbps-openstack-for-providing-high-performance-nfv
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/03/opencv-past-present-and-future-a-presentation-from-opencv-org/
For more information about edge AI and vision, please visit:
http://www.edge-ai-vision.com
Gary Bradski, the President and CEO of OpenCV.org, delivers the presentation “OpenCV: Past, Present and Future” at the Edge AI and Vision Alliance’s March 2020 Vision Industry and Technology Forum. Bradski shares the latest developments in the OpenCV open source library for computer vision and deep learning applications, as well as where OpenCV is heading.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2019-alliance-vitf-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President of Developer Ecosystems at NVIDIA, delivers the presentation "Current and Planned Standards for Computer Vision and Machine Learning" at the Embedded Vision Alliance's December 2019 Vision Industry and Technology Forum. Trevett shares updates on recent, current and planned Khronos standardization activities aimed at streamlining the deployment of embedded vision and AI.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsJenny Midwinter
Slides from Ottawa Machine Learning Meetup from January 16, 2016.
Pierre Paulin, Director of R&D at Synopsys (Embedded Vision Subsystems) , will be will be making a presentation on:
“Applying Deep Learning Vision Technology to Low-Cost, Low-Power Embedded Systems: An Industrial Perspective”
Hire a Machine to Code - Michael Arthur Bucko & Aurélien NicolasWithTheBest
Bucko and Nicolas share their vision and products, as well as their explanation of what Deckard is. They provide insights from the software development team. They believe coding can resolve problems that we face. Specifically, source coding is the solution that they teach you and they have hopes for in fixing human errors.
Michael Arthur Bucko & Aurélien Nicolas
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-mallick
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Satya Mallick, Interim CEO of OpenCV.org, presents the "OpenCV: Current Status and Future Plans" tutorial at the May 2019 Embedded Vision Summit.
With over two million downloads per week, OpenCV is the most popular open source computer vision library in the world. It implements over 2500 opt- imized algorithms, works on all major operating systems, is available in multiple languages and is free for commercial use.
This talk primarily provides a technical update on OpenCV: What’s new in OpenCV 4.0? What is the Graph API? Why are we so excited about the Deep Neural Network (DNN) module in OpenCV? (Short answer: It is one of the fastest inference engines on the CPU.)
Mallick also shares plans for the future of OpenCV, including new algorithms that the organization plans to add through the Google Summer of Code this year. And he briefly shares information on the new Open Source Vision Foundation (OSVF), on OpenCV’s sister organizations, CARLA and Open3D, and on some of the initiatives planned by these organizations.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit-khronos
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of Khronos and Vice President at NVIDIA, presents the "OpenVX Hardware Acceleration API for Embedded Vision Applications and Libraries" tutorial at the May 2014 Embedded Vision Summit.
This presentation introduces OpenVX, a new application programming interface (API) from the Khronos Group. OpenVX enables performance and power optimized vision algorithms for use cases such as face, body and gesture tracking, smart video surveillance, automatic driver assistance systems, object and scene reconstruction, augmented reality, visual inspection, robotics and more.
OpenVX enables significant implementation innovation while maintaining a consistent API for developers. OpenVX can be used directly by applications or to accelerate higher-level middleware with platform portability. OpenVX complements the popular OpenCV open source vision library that is often used for application prototyping.
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019corehard_by
GPGPU -- это использование графического процессора (GPU) для выполнения общих вычислений, которые обычно проводит центральный процессор (CPU). Благодаря большим вычислительным ресурсам GPU, данный подход позволяет ускорить некоторые приложения в десятки раз по сравнению с традиционным CPU. Принимая во внимание, что GPU есть во множестве современных устройств, данный подход может стать полезных инструментом для программиста, заботящегося о производительности своих программ. Доклад является введением в технологию GPGPU. В ходе презентации, обсуждаются различия между CPU и GPU на аппаратном уровне и объясняется, как эти различия привели к разным моделям программирования этих устройств. Будут рассмотрены классы задач, которые хорошо ускоряются при помощи GPGPU, и когда GPU может оказаться медленнее чем CPU. Доклад не фокусируются на каком-то определенном GPGPU API (OpenCL, CUDA и т.д.) и не требует от слушателей предварительных знаний аппаратуры GPU или CPU.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit-trevett
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Neil Trevett, President of the Khronos Group and Vice President at NVIDIA, presents the "APIs for Accelerating Vision and Inferencing: An Industry Overview of Options and Trade-offs" tutorial at the May 2019 Embedded Vision Summit.
The landscape of SDKs, APIs and file formats for accelerating inferencing and vision applications continues to evolve rapidly. Low-level compute APIs, such as OpenCL, Vulkan and CUDA are being used to accelerate inferencing engines such as OpenVX, CoreML, NNAPI and TensorRT, being fed by neural network file formats such as NNEF and ONNX.
Some of these APIs, like OpenCV, are vision-specific, while others, like OpenCL, are general-purpose. Some engines, like CoreML and TensorRT, are supplier-specific, while others such as OpenVX, are open standards that any supplier can adopt. Which ones should you use for your project? Trevett answers these and other questions in this presentation.
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
Software AI Accelerators deliver orders of magnitude performance gain for AI across deep learning, classical machine learning, and graph analytics and are key to enabling AI Everywhere. Get started on your AI Developer Journey @ software.intel.com/ai.
Time critical multitasking for multicoreijesajournal
This paper presents the research work on multicore microcontrollers using parallel, and time critical
programming for the embedded systems. Due to the high complexity and limitations, it is very hard to work
on the application development phase on such architectures. The experimental results mentioned in the
paper are based on xCORE multicore microcontroller form XMOS®. The paper also imitates multi-tasking
and parallel programming for the same platform. The tasks assigned to multiple cores are executed
simultaneously, which saves the time and energy. The relative study for multicore processor and multicore
controller concludes that micro architecture based controller having multiple cores illustrates better
performance in time critical multi-tasking environment. The research work mentioned here not only
illustrates the functionality of multicore microcontroller, but also express the novel technique of
programming, profiling and optimization on such platforms in real time environments.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/sept-2014-member-meeting-linley
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Linley Gwennap, founder and principal analyst of The Linley Group, delivers the presentation "Processors for Embedded Vision: Technology and Market Trends" at the September 2014 Embedded Vision Alliance Member Meeting.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/amd/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Allen Rush, Fellow at AMD, presents the "How Computer Vision Is Accelerating the Future of Virtual Reality" tutorial at the May 2016 Embedded Vision Summit.
Virtual reality (VR) is the new focus for a wide variety of applications including entertainment, gaming, medical, science, and many others. The technology driving the VR user experience has advanced rapidly in the past few years, and it is now poised to proliferate into these applications with solid products that offer a range of cost, performance and capabilities. The next question is: how does computer vision intersect this emerging modality? Already we are seeing examples of the integration of computer vision and VR, for example for simple eye tracking and gesture recognition. This talk explores how we can expect more complex computer vision capabilities to become part of the VR landscape and the business and technical challenges that must be overcome to realize these compelling capabilities.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit-opencv
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Gary Bradski, President and CEO of the OpenCV Foundation, presents the "OpenCV Open Source Computer Vision Library: Latest Developments" tutorial at the May 2015 Embedded Vision Summit.
OpenCV is an enormously popular open source computer vision library, with over 9 million downloads. Originally used mainly for research and prototyping, in recent years OpenCV has increasingly been used in deployed products on a wide range of platforms from cloud to mobile.
The latest version, OpenCV 3.0 is currently in beta, and is a major overhaul, bringing OpenCV up to modern C++ standards and incorporating expanded support for 3D vision. The new release also introduces a modular “contrib” facility that enables independently developed modules to be quickly integrated with OpenCV as needed, providing a flexible mechanism to allow developers to experiment with new techniques before they are officially integrated into the library.
In this talk, Gary Bradski, head of the OpenCV Foundation, provides an insider’s perspective on the new version of OpenCV and how developers can utilize it to maximum advantage for vision research, prototyping, and product development.
Mobile Computer Vision requires deep SoC-based optimization and extensive amount of development resources. This presentation reviews the challenges of mobile computer vision optimization, the vision for a cross-platform API and the current solution of using FastCV
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemAI Frontiers
This presentation will demonstrate our recent progress in developing advanced computer vision algorithms using embedded platforms for video-based face recognition, vehicle attribute analysis, urban management event detection, and high-density crowd counting. These algorithms combine the traditional CV approach with recent advances in deep learning to make high-performance computer vision systems practical and enable products in several vertical markets including intelligent transportation systems (ITS), business intelligence (BI), and smart video surveillance. We will demonstrate algorithm design and optimization scheme for several recently available processors from Movidius, Nvidia, and ARM.
Using GPUs to Handle Big Data with JavaTim Ellison
A copy of the slides presented at JavaOne conference 2014.
Learn how Java can exploit the power of graphics processing units (GPUs) to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
Various virtualization technologies are present at the market for more than a decade, but they were typically occupying cloud platforms. Recently, virtualization began spreading over embedded platforms after ARM presented Virtualization Extension for its recent processors. Various peripherals (like disks and network) had been easily virtualized for usage by several operating systems at once, but things like Graphical Processing Units (GPU) remain to be one of the most intricate parts to be adapted, with very few vendors who actually managed to do it.
Sergiy Kibrik (Software Engineer, GlobalLogic) explain how it was done at GlobalLogic. This presentation was delivered at GlobalLogic Embedded TechTalk Kyiv on July 22, 2015.
Checkpointing the Un-checkpointable: MANA and the Split-Process Approachinside-BigData.com
In this deck from the MVAPICH User Group, Gene Cooperman from Northeastern University presents: Checkpointing the Un-checkpointable: MANA and the Split-Process Approach.
"Checkpointing is the ability to save the state of a running process to stable storage, and later restarting that process from the point at which it was checkpointed. Transparent checkpointing (also known as system-level checkpointing) refers to the ability to checkpoint a (possibly MPI-parallel or distributed) application, without modifying the binaries of that target application. Traditional wisdom has assumed that the transparent checkpointing approach has some natural restrictions. Examples of long-held restrictions are: (i) the need for a separate network-aware checkpoint-restart module for each network that will be targeted (e.g., one for TCP, one for InfiniBand, one for Intel Omni-Path, etc.); (ii) the impossibility of transparently checkpointing a CUDA-based GPU application that uses NVIDIA UVM (UVM is "unified virtual memory", which allows the host CPU and the GPU device to each access the same virtual address space at the same time.); and (iii) the impossibility of transparently checkpointing an MPI application that was compiled for one MPI library implementation (e.g., for MPICH or for Open MPI), and then restarting under an MPI implementation with targeted optimizations (e.g., MVAPICH2-X or MVAPICH2-EA). This talk breaks free from the restrictions described above, and presents an efficient, new software architecture: split processes. The "MANA for MPI" software demonstrates this split-process architecture. The MPI application code resides in "upper-half memory", and the MPI/network libraries reside in "lower-half memory". The tight coupling of upper and lower half ensures low runtime overhead. And yet, when restarting from a checkpoint, "MANA for MPI" allows one to choose to replace the original lower half with a different MPI library implementation. This different MPI implementation may offer such specialized features as enhanced intra- and inter-node point-to-point performance and enhanced performance of collective communication (e.g., with MVAPICH2-X); or perhaps better energy awareness (e.g., with MVAPICH2-EA). Further, the new lower half MPI may be optimized to run on different hardware, including a different network interconnect, a different number of CPU cores, a different configuration of ranks-per-node, etc. This makes cross-cluster migration both efficient and practical. This talk represents joint work with Rohan Garg and Gregory Price."
Watch the video: https://wp.me/p3RLHQ-kMn
Learn more: http://mug.mvapich.cse.ohio-state.edu/program/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
ScicomP 2015 presentation discussing best practices for debugging CUDA and OpenACC applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the on-demand sessions from the OpenACC Summit 2020, upcoming GPU Hackathons and Bootcamps, an OpenACC-to-FPGA framework, the NERSC GPU Hackathon, new resources and more!
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers working on applications for the new Frontier supercomputer, using OpenACC for weather forecasting, upcoming GPU Hackathons and Bootcamps, and new resources!
Engineering software is widely employed for its powerful abstraction of scientific and technical knowledge. It enables productive applications, e.g., analysis, prototyping, and manufacturing. Making engineering software requires a profound understanding in the problem domain, as well as the art of engineering it.
Software engineering differs substantially from conventional engineering. To professionally build software, mathematicians, scientists, and engineers need skills including system administration, automatic build, automatic testing, version control, to name but a few. Computer science knowledge like algorithms and data structures is also indispensable. It is a joyful, interdisciplinary, and world-changing enterprise worth sharing with all future engineering practitioners.
Similar to 2014/07/17 Parallelize computer vision by GPGPU computing (20)
This is an academic talk for professors and graduate students. In addition to introducing recent trends in embedded computer vision (ECV), I also present our research experience in ECV.
My slides for acamedia talk about embedded vision in 2010. Some of our research results are also presented in this presentation.
Few slides have chinese characters.
It is a presentation for acamedia talk about cloud computing for intelligent video surveillance, i.e. VSaaS, given in 2010. Some of our research results are also presented in this presentation.
It is a presentation for acamedia talk about intelligent video surveillance and video sousveillance given in 2010. Some of our research results are also presented in this presentation.
More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (16)
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
The Internet of Things (IoT) is a revolutionary concept that connects everyday objects and devices to the internet, enabling them to communicate, collect, and exchange data. Imagine a world where your refrigerator notifies you when you’re running low on groceries, or streetlights adjust their brightness based on traffic patterns – that’s the power of IoT. In essence, IoT transforms ordinary objects into smart, interconnected devices, creating a network of endless possibilities.
Here is a blog on the role of electrical and electronics engineers in IOT. Let's dig in!!!!
For more such content visit: https://nttftrg.com/
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Technical Drawings introduction to drawing of prisms
2014/07/17 Parallelize computer vision by GPGPU computing
1. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Wang, Yuan-Kai (王元凱)
Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)
ykwang@mail.fju.edu.tw
http://www.ykwang.tw
2014/07/17
Parallelize Computer Vision
by GPGPU Computing
1
2. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
About this Course
❖ Multicore Era for Computer Vision
❖ GPGPU
❖ Parallel Programming
(CUDA, OpenCL, Renderscript)
❖ OpenCV Acceleration with GPGPU
❖ Computer Vision Acceleration
2
3. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
1. Multicore Era for
Computer Vision
Paradigm shift
from Clock Speed Race
to Multicore Race
3
4. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Computing
❖ What Is Multicore
• Combine multiple processors
(CPU, DSP, GPGPU, FPGA)
into single chip
❖ Multicore computing is inevitable
4
5. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Moore's Law
❖ In 1965, Gordon Moore (Intel co-founder)
predicted
• The transistors no. on an IC would double
every 18 months
❖ The well-known law
• The performance of computer
doubles every 18 months
• More transistors
→ More performance
❖ The prediction was
kept correctly by
Intel's CPUs for 40 years
5
6. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Review of Moore's Law
❖ Transistors in a chip did increase
6
Software enjoys the fruits of hardware's labour.
7. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Problems
❖ More transistors need high frequency
• We come into the Clock Speed Race
❖ But high frequency needs high power
consumption
• High power consumption è Heat problem
• 4GHz has been the limit of Moore’s law
7
8. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Paradigm Shift from 2000 AD
❖ General-purpose multicore
comes of age
❖ Chip companies race to create multicore
processors
• CPU: Intel Core Duo, Quad-core,
ARM v7, ...
• DSP: TI OMAP, ARM NEON, …
• GPU/GPGPU:
• nVidia: GeForce/Tesla, Tegra
• ARM: Mali-T6x
• …
8
9. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
The Multicore Evolution
Pentium processor
Optimized for single
thread
Core Duo 5~10 years
10~100 energy efficient
cores optimized for
parallel execution
From large mono-core to multiple lightweight cores
9
10. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Moore’s Law Needs Multicore
❖ Single core cannot fit Moore's law
❖ Multicore can fit Moore's law if a
parallel programming model exists
Time
Performance
Single Core
Multi-Core
10
11. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Two Architectures
for Multicore
❖ Symmetric multiprocessing (SMP)
• Multicore CPU, GPGPU, DSP multicore
• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)
• CPU+GPGPU,
CPU+FPGA,
CPU+DSP
• Heterogeneous computing
11
12. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multicore CPU (1/2)
❖ Two or more CPUs in a chip
❖ Ex.: Intel Core i7
12
Multiple
Execution Cores
14. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (1/2)
❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed
up 3D graphics
• Game playing
is a major
application
❖ GPGPU: General-Purpose GPU
• General purpose computation using
GPU in applications other than 3D
graphics
14
15. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (2/2)
❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores
(Many-core vs. Multi-core)
❖ GPGPU is more powerful than
multicore CPU
❖ Vendors:
• nVidia
• Quadcomm
(AMD, ATI)
• ARM
• Intel
15
16. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16
It is the Software, Stupid
❖Gary Smith and Daya Nadamuni, Gartner
Dataquest, Design Automation Conf., 2006
❖The biggest problem with SoC design
is embedded software development.
❖The next big hurdle is
programmability. It's the ability to
program these multicore platforms."
❖You can have elegant algorithms,
first-pass silicon, and fancy intellectual
property. But without software, the
product goes nowhere.
20. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Video
Capture
Image
Enhance
Object
/Event
Detection
Object
Tracking
Object
/Event
Recognition
Behavior
Analysis Retrieval
Imaging
Event Detection
Abnormal Detection Face Recognition Retrieval
TripwireImage/Video Enhancement
A Complete Vision System
– Video Surveillance as an Example
20
21. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Computer Vision Needs
High Performance Computing
❖ A CV example : video processing
• Intelligent video surveillance,
❖ Its complexity is high
• Video (1080p RGB):
6 Megapixels per frame, 30fps
• 100 – 1K flops per pixel
• ⇒ 18 - 180 Gigaflops per second
❖ Massive data processing
• Intensive computation
21
23. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
However
❖ Can CV algorithms speed-up every 18
months with multicore?
❖ Multicore is not a simple solution for
upgrading CV algorithm performance
• The transition from single core to
multicore will be blocked by software
• We are not ready to face the software
programming challenges
• It is the software, stupid.
23
25. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading Demands
New Programming Skills
❖ Previous multi-threading techniques
❖ Windows thread, pthread, OpenMP,
MPI, …
❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,
OpenACC, Map Reduce, …
❖ Concepts
• Race condition, deadlock,
• Domain partition, function partition, …
25
26. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Programming
Practice (MPP)
❖ Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
• Proposed by a
MPP working group
in the Multicore
Association
http://www.multicore-association.org/workgroup/mpp.php
26
27. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenACC
❖ An organization develops API to
• describes a collection of
compiler directives
• To specify loops and regions of
code in standard C, C++ and Fortran
• To be offloaded from a host CPU to
an attached accelerator, including
•APUs, GPUs, and many-core
coprocessor
27
28. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
HSA Foundation
❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM,
SAMSUNG, TI
❖System architecture easing efficient
use of accelerators, SoCs
• Intended to support high-level parallel
programming frameworks
• OpenCL, C++, C#, OpenMP, Java
• Accelerator requirements
• Full-system SVM, memory coherency,
preemption,
user-mode dispatch
28
29. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
The ParLab in Berkeley
❖ The Parallel Computing Lab. in UC
Berkeleyhttp://parlab.eecs.berkeley.e
du
• The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.
29
31. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
❖ Royalty-free, cross-platform, cross-
vendor standard
•Targeting: supercomputers
è embedded systems
è mobile devices
❖Enables programming of diverse
compute resources
•CPU, GPU,
DSP, FPGA …
31
32. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL Working Group
Members
❖Diverse industry participation – many
industry experts
❖NVIDIA is chair, Apple is specification
editor
32
33. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ Vendor, Hardware
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
33
37. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
PC Platform
• Discrete GPUs
• GPGPU card as a coprocessor
From PC to PSC (Personal Super-Computer)
37
PCIe
38. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Mobile Platform
• Integrated GPUs
• GPGPU sub-chip as a coprocessor
From mobile phone to mobile personal computer
38
No PCIe
GPGPU
CPU
39. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - nVidia
• Compute Architecture:
Tesla, Fermi, Kepler, …
• PC
• GeForce, Quadro
• Tesla
• 870, 1060, 2070, K40
• Mobile
• Tegra: …, 4, K1(192 cores)
39
It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.
40. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions
– Qualcomm/AMD
❖ Qualcomm, AMD, ATI
❖ APU: integrated CPU+GPU
❖ Low energy consumption
❖ PC(AMD): FirePro
❖ Mobile(Snapdragon):
❖ Adreno: 330(32 cores)
40
41. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - ARM
❖ Mali
❖ Samsung Exynos, MediaTek
❖ Compute engine
after T-600
❖ Exynos 5
❖ At most 8 cores
(Mali-T678)
41
42. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Intel – Multicore CPU
• PC (Xeon Phi)
• IRIS pro GPU
• Knight Landing: 60 cores
• Knight Cover: 48 CPU cores,
PCIe
• Mobile
• Haswell
• Atom
42
44. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Heterogeneous Architecture
❖Host: CPU
❖Device: GPGPU
❖Notice: memory hierarchy in device
44
45. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPUs Architecture
- nVidia
❖ GT200
• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi
• Tesla 2060
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU(host)
Multicore
GPU(device)
Many-core
45
46. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
nVidia GPGPU Architecture
❖ SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
46
47. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Memory Hierarchy
❖ On-Chip Memory
• Registers
• Shared Memory
• Constant Memory
• Texture Memory
❖ Off-Chip Memory
• Local Memory
• Global Memory
47
48. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce
GTX 280, GTX580
❖FPGA: Xilinx Virtex4,
Virtex5
A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local
Image Features, IEEE Transactions on Computers, 2012.
48
49. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce 7900 GTX
❖FPGA: Xilinx Virtex-4
Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case
Study, IEEE Transactions on Computers, 2010.
49
50. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA vs. Multicore
❖Application: 2-D image convolution
GPU: nVidia GeForce 295 GTX
FPGA: Altera Stratix III E260
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, ACM/SIGDA international symposium on FPGA, 2012.
50
52. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Hardware vs. Software
52
GPGPU
nVidia
Qualcomm
ARM
Intel
Parallel
Programming
CUDA
OpenCL
RenderScript
C++ AMP
53. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
• CUDA, renderscript, OpenCL, …
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
53
55. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing
❖ Serial
Computing
❖ Parallel
Computing
CPU/GPU
55
Core
Core
Core
Core
56. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming
❖ Many codes are written in C/C++/Java
• Especially algorithmic programs
❖ Can we write GPGPU parallel
programs by C/C++/Java?
❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:
sequence, selection, repetition
56
57. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading
❖ Multi-threading is the fundamental
concept for parallel programming
• Some techniques are ready
• Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
• New techniques
• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
57
59. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming in
Sequential Language
❖ Do we need to learn new languages for
multi-threading?
• No
❖ Write multi-threading codes in C/C++
• Add functions/directives to C/C++ for
multi-threading
• That is the way current solutions did
• pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
59
60. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Decompose the Problem
❖ Two basic approaches to partition
computational work
• Domain decomposition
• Partition the data used
in solving the problem
• Function decomposition
• Partition the jobs (functions)
from the overall work (problem)
GPGPU
CPU
Cooperate
60
61. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multi-Threading
❖ A program running
In Serial
http://en.wikipedia.org/wiki/Thread_(computer_science)
In Parallel
61
62. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (1/3)
❖An image example
• It is 2D data
• Three popular partition ways
62
63. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (2/3)
❖Domain data are usually processed
by loop
• for (i=0; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
Original image(img1) Enhanced image(img2)
The X-ray image
of a circuit board
i
j
SIMD
SPMD
SIMT
63
64. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (3/3)
❖A three-block partition
example
• // Thread 1
for (i=0; i<height/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
• // Thread 2
for (i=height/3; i<height*2/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
• // Thread 3
for (i=height*2/3; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
i
j
OpenMP
CUDA(SPMD)
fork(threads)
join(barrier)
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
subdomain 1 subdomain 2 subdomain 3
64
65. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming:
SIMT model
❖ CPU (“host”) program often
written in C or C++
❖ GPU code is written as a sequential
kernel in (usually) a C or C++
dialect
65
69. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
❖ CUDA: Compute Unified Device
Architecture
❖ Parallel programming
for nVidia's GPGPU
❖ Use C/C++ language
• Java, Fortran, Matlab are OK
❖ When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
69
70. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Hardware Environment:
CPU+GPU
❖ CPU
• Organizes, interprets, and
communicates information
❖ GPU
• Handles the core processing on large quantities
of parallel information
• Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU
CPU GPU
PCI-E
70
72. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Processing Flow on CUDA
Copy processing
data
2
Copy the
result
5 Instruct the
processing
3
Main
Memory
CPU
Memory
for GPU Execute
parallel in
each core
4
Release
device memory
6
Allocate
device memory
1
72
73. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Programming with
Memory Hierarchy
❖ Locality
principle
• Temporal
locality
• Spatial
locality
73
74. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/3)
int main()
{
char src[12]="Hello World";
char h_hello[12];
char* d_hello1;
char* d_hello2;
cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
Host
src
h_hello
Device
d_hello1
d_hello2
call the kernel function
74
75. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
❖ Kernel Function
__global__ void hello(char* hello1 , char* hello2 )
{
int k;
for(k = 0 ; hello1[k] != '0' ; k++){
hello2[k] = hello1[k];
}
}
Host
src
h_hello
Device
d_hello1
d_hello2
No parallel processing in this example
75
79. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
What's OpenCL
❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware
• Dynamically interrogate system load and
balance across available processors
❖Powerful, low-level flexibility
• Foundational access to compute resources for
higher-level engines, frameworks and
languages
79
80. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Broad OpenCL Implementer
Adoption
❖Multiple conformant implementations shipping
on desktop and mobile
❖Android ICD extension released in latest
extension specification
❖Multiple implementations shipping in Android
NDK
80
84. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
AMD OpenCL Optimization
Case Study
❖Platform
• AMD Phenom II X4 965 CPU (quad core)
• ATI Radeon HD 5870 GPU
❖Unoptimized CPU performance: 1 GFLOP/s
❖Optimized CPU performance reaches: 4 GFLOP/s
❖Optimized GPU performance reaches: 50 GFLOP/s
84
90. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(1/2)
❖Microsoft’s C++ AMP (Accelerated Massive
Parallelism)
• Part of Visual C++, integrated with Visual
Studio, built on Direct3D
• “Performance for the mainstream”
❖STL-like library for multidimensional array
data
• Special convenience support for 1, 2, and 3
dimensional arrays on CPU or GPU
• C++ AMP runtime handles CPU<->GPU data
copying
• Tiles enable efficient processing of sub-arrays
90
91. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(2/2)
❖Parallel_for_each
•Executes a kernel (C++ lambda) at
each point in the extent
•restrict() clause specifies where to
run the kernel: cpu (default) or
direct3d (GPU)
91
95. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(1/2)
❖Higher-level than CUDA or OpenCL: simpler &
less performance control
• Emphasis on mobile devices & cross-SoC
performance portability
❖Programming model
• C99-based kernel language, JIT-compiled,
single input-single output
• Automatic Java class reflection
• Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3
• Script groups combine kernels to amortize
launch cost & enable kernel fusion
95
96. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(2/2)
❖ Data type:
• 1D/2D collections of elements, C types like int
and short2, types include size
• Runtime type checking
❖ Parallelism
• Implicit: one thread per data element,
atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
96
112. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Comparison (1/2)
❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of
the three programming models in google android." in Proc. First Asia-
Pacific Programming Languages and Compilers Workshop (APPLC). 201
112
113. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Comparison(2/2)
❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw)
constant memory
OpenCL’s portability does not
fundamentally affect its performance
Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A
comprehensive performance comparison of CUDA and OpenCL." in
Proc. International Conference Parallel Processing (ICPP), 2011.
113
114. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming
114
Performance: more control, better performance
Productivity: ease use, quick programming,
portability
115. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading
❖ Data Parallelization
• Data distribution
• Parallel convolution
• Reduction algorithm
• Amdahl’s law
❖ Memory Hierarchy Management
• Locality principle
• Program accesses a relatively small portion
of the address space at any instant of time
Parallelization
115
116. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Multi-thread Programming with
the Discipline of Parallelization
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
116
117. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
117
121. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV GPU Module
❖Implemented using NVIDIA CUDA
Runtime API
❖Latest version: 2.4.9
• Utilizing Multiple GPUs
❖Implemented modules: 11
❖Implemented functions: 270
Focus on PC platform
Not fully compatible to mobile GPGPU on Android
121
127. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA De-noising
❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising()
❖Edge preserving smoothing
• gpu::bilateralFilter()
127
128. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Fourier and MeanShift
❖Fourier analysis
•gpu::dft(), ::convolve(),
::mulAndScaleSpectrums(), etc..
❖MeanShift
•gpu::meanShiftFiltering(),
::meanShiftSegmentation()
128
129. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Shape Detection
❖Line detection (e.g., lane detection, building
detection, perspective correction)
• gpu::HoughLines(), ::HoughLinesDownload()
❖Circle detection (e.g., cells, coins, balls)
• gpu::HoughCircles(),
::HoughCirclesDownload()
129
130. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Detection
❖HAAR and LBP cascaded adaptive boosting
(e.g., face, nose, eyes, mouth)
• gpu::CascadeClassifier_GPU::detectMulti
Scale()
❖HOG detector (e.g., person, car, fruit, hand)
• gpu::HOGDescriptor::detectMultiScale()
130
131. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Recognition
❖Interest point detectors
• gpu::cornerHarris(), ::cornerMinEigenVal(),
::SURF_GPU, ::FAST_GPU, ::ORB_GPU(),
::GoodFeaturesToTrackDetector_GPU()
❖Feature matching
• gpu::BruteForceMatcher_GPU(),
::BFMatcher_GPU()
131
132. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Stereo and 3D
❖RANSAC
• gpu::solvePnPRansac()
❖Stereo correspondence (disparity map)
• gpu::StereoBM_GPU(),
::StereoBeliefPropagation(),
::StereoConstantSpaceBP(),
::DisparityBilateralFilter()
❖Represent stereo disparity as 3D or 2D
• gpu::reprojectImageTo3D(),
::drawColorDisp()
132
133. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Optical Flow
❖Dense/sparse optical flow
gpu::FastOpticalFlowBM(),
::PyrLKOpticalFlow, ::BroxOpticalFlow(),
::FarnebackOpticalFlow(),
::OpticalFlowDual_TVL1_GPU(),
::interpolateFrames()
133
134. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Background
Segmentation
❖Foregrdound/background segmentation (e.g.,
object detection/removal, motion tracking,
background removal)
• gpu::FGDStatModel, ::GMG_GPU,
::MOG_GPU, ::MOG2_GPU
134
136. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
136
137. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
5. Computer Vision
Acceleration on PC
Image enhancement (HDR)
Feature extraction
Video surveillance cloud
137
139. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
❖ Restore and enhance an image
❖ Its complexity is high for large images
HDR Image Enhancement
Original RestoredComplexity:
O(N2) ~ O(N3)
139
140. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Algorithms for
Image Restoration
❖ Wiener Filter
❖ Histogram Based Approach
• Histogram Equalization,
Histogram Modification, …
❖ Retinex
• Path-based Retinex
• Recursive Retinex
• Center/surround Retinex
• No iterative process and is suitable for parallelization
• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
140
141. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
MSRCR Algorithm
• : the MSRCR output
• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band
N : the number of spectral bands
: the gain constant
: controls the strength of the nonlinearity
141
142. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
The Method
Gaussian Blur
Log-domain
Processing
Normalization
Copy Data
from CPU to
GPGPU
Copy Data
from GPGPU to
CPU
GPGPUCPU
Histogram
Stretching
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm."
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for
accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
142
143. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading
• Tesla C1060 : 240 SP (Stream Processor)
• CUDA: , Thread , Block , Grid
❖ Data Parallelization
• Parallel convolution
Parallelization by GPGPU
• Parallel convolution
A(0)
A(1)
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(0)+A(1)
A(2)+A(3)
A(4)+A(5)
A(6)+A(7)
A(0)+A(1)+A(2)+A(3)
A(4)+A(5)+A(6)+A(7)
sum
PE data time
t0 t1 t2 t3 t4 t5
0
1
2
3
4
5
6
7
PE i
{
{
pixels
pixels
M pixels
M
pixels
PE ipixels
pixels
pixels
pixels
1 pixels 1 pixels
1 pixels 1 pixels
143
161. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV4Android SDK
❖Enables development of Android applications
with use of OpenCV library.
❖Use java native interface (JNI) directly access c
code
❖Support nVIDAs’ Tegra android development
pack(TADP)
Not fully
compatible with
GPU module
161
175. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image
Processing Intrinsics
Name Operation
ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol
ve5x5
Performs a 3x3 or 5x5 convolution.
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale
and RGBA buffers and is used by the system
framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to
process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a
buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
175
176. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Gaussian Blur Example
by RenderScript Intrinsic
RenderScript rs = RenderScript.create(theActivity);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,
Element.U8_4(rs));;
Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(25.f);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);
176
180. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Performance of
RenderScript Intrinsics
❖On new Nexus 7
❖Relative to equivalent multithreaded C
implementations.
180
181. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image
Processing Benchmarks(1/2)
❖CPU only on a Galaxy Nexus device.
181
183. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Acceleration of Retinex Using
RenderScript
❖This paper presents an implementation of
rsRetinex, an optimized Retinex algorithm by
using Renderscript technique.
❖The experimental results show that rsRetinex
could gain up to five times speedup when applied
to different image resolution.
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image
Processing on Android Device Using Renderscript." in Proc. The 8th International
Conference on Robotic, Vision, Signal Processing & Power Applications, 2014.
183
184. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Mobile GPGPU List
Adoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm
Adreno
Google Nexus 10,
Google new Nexus 7,
SONY Xperia Tablet Z2
1.2(302~420) OCL
module
Android 4.0
later
ARM Mali Nexus 10, Samsung
Note 3, Samsung Note
PRO 12.2, Meizu MX3
OpenCL 1.1
(T604~T678)
OCL
module
Android 4.0
later
nVIDIA
Tegra
Google Project Tango,
HTC Nexus 9, Microsoft
Surface 2, Nvidia Shield
Note 7
CUDA, OpenCL
1.2(K1 only)
GPU
module
Android 4.0
later(K1 only)
AnandTech
PowerVR
iPad Air, iPad mini OpenCL 1.2 OCL
module
none
Intel HD
Graphics
Microsoft Surface Pro 3,
Sony VAIO Tap 11
OpenCL 1.1 OCL
module
none
Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.
184
186. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU
❖ Single-core
è Multi-core
è Many-core
❖PC
• nVidia Tesla + CUDA/OpenCV
❖Android
• Qualcomm Adreno + OpenCV ocl
• nVidia Tegra + OpenCV gpu
186
187. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming
❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP
• OpenCL
❖Java
• OpenCL, RenderScript
❖Notice that OpenCL and
RenderScript is
• Not Efficient in parallelization.
• Efficient in CV algorithmic design.
187
188. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (1/2)
❖Ver. 2.4.x
• gpu module: CUDA, PC
• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)
• Transparent API for GPGPU
acceleration
188
190. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL 2.0
❖Released in 2013
❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory
management
❖Dynamic (Nested) Parallelism
• Allows a device to enqueue kernels onto
itself – no round trip to host required
❖Disadvantage
• Strong hardware support
• Not well supported in current GPGPUs
190
191. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA still Dominant in the
Near Future
❖ However, we have to manually parallelize
the algorithm: more design overhead
❖ We need expertise in
• Algorithms of image and signal processing
• Filtering, frequency analysis, compression,
feature extraction, recognition, ...
• Theory, tools and methodology of parallel
computing
• Communication, synchronization, resource
management, load balancing, debugging, ...
191
192. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Multimedia
Motion Estimation for
H.264/AVC on
Multiple GPUs
Using NVIDIA CUDA
10 X
CUDA JPEG Decoder
10 X
DivideFrame GPU Decoder
Hyperspectral Image
Compression on
NVIDIA GPUs
10 X
GPU Decoder
(Vegas/Premiere) -
Using the Power of
NVIDIA Graphic Card to
Decode H.264 Video Files
26 X
PowerDirector7 Ultra
3.5X
192
193. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(1/2)
87 X
CUDA SURF – A Real-
time
Implementation for SURF
TU Darmstadt
26 X
Leukocyte Tracking:
ImageJ Plugin
University of Virginia
200 X
Real-time Spatiotemporal
Stereo Matching Using the
Dual-Cross-Bilateral Grid
100 X
Image Denoising with
Bilateral Filter
Wlroclaw University
of Technology
85 X
Digital Breast
Tomosynthesis
Reconstruction
Massachusetts General
Hospital
100 X
Fast Optical Flow on GPU
At Video Rate for Full HD
Resolution
Onera
8 X
A Framework for Efficient
and Scalable Execution of
Domain-specific Templates
On GPU
NEC Labs, Berkeley, Purdue
13 X
Accelerating Advanced MRI
Reconstructions
University of Illinois
193
194. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(2/2)
20 X
GPU for Surveillance
13 X
Fast Human Detection with
Cascaded Ensembles
109 X
Fast Sliding-Window
Object Detection
263 X
GPU Acceleration of Object
Classification Algorithm
Using NVIDIA CUDA
10 X
Real-time
Visual Tracker by
Stream Processing
45 X
A GPU Accelerated
Evolutionary
Computer Vision System
3 X
Canny Edge Detection
300 X
Audience Measurement –
Real-time Video Analysis
for Counting People, Face
Detection and Tracking
194
196. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Readings (1/2)
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel
algorithm for accelerating retinex." Journal of Real-Time Image
Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time
phase-based optical flow, stereo, and local image features."
Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A
review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors
to reconfigurable logic: a case study." Computers, IEEE
Transactions on 59.4 (2010): 433-448.
196
197. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Readings (2/2)
❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV
❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,”
http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of
heterogeneous systems,”
https://www.khronos.org/opencl/
197
198. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration
❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation
❖ OpenCL Module Introduction - opencv documentation!
❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.
❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012):
61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated
framework for image processing and computer vision."
Advances in Visual Computing. Springer Berlin
Heidelberg, 2008. 430-439.
198
199. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
❖ CUDA Programming guide. nVidia.
❖ CUDA Best Practices Guide. nVidia.
❖ CUDA Reference Manual. nVidia.
❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone
❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html
❖ Applications of CUDA for Imaging and Computer
Vision
http://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)
http://developer.nvidia.com/object/npp_home.html
199
200. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl
❖ AMD OpenCL Resources:
http://developer.amd.com/opencl
❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl
❖ Books
• Using OpenCL: Programming Massively Parallel Computers.
IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.
• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.
• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
200
201. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript
❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.ht
ml
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li.
"Comparison and analysis of the three programming
models in google android." First Asia-Pacific
Programming Languages and Compilers Workshop.
2012.
❖ "High Performance Apps Development with
RenderScript," 12th Kandroid Conference, 2013.
201
202. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Web Sites and Resources
❖Embedded Vision Alliance,
http://www.embedded-vision.com
❖GPUComputing.Net,
http://www.gpucomputing.net
❖HAS Foundation, www.hsafoundation.com
❖
202
203. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing with
GPGPU
❖Programming Massively Parallel
Processors – A Hands-on Approach
• D. B. Kirk, W. M. Hwu
• Morgan Kaufmann, 2010
• http://www.nvidia.com/object/promotion_david_kirk_book.html
203