Deep Convolutional Network evaluation on the Intel Xeon PhiGaurav Raina
With a sharp decline in camera cost and size along with superior computing power available at increasingly low prices, computer vision applications are becoming ever present in our daily lives. Research shows that Convolutional Neural Networks (ConvNet) can outperform all other methods for
computer vision tasks (such as object detection) in terms of accuracy and versatility.
One of the problems with these Neural Networks, which mimic the brain, is that they can be very demanding on the processor, requiring millions of computational nodes to function. Hence, it is challenging for Neural Network
algorithms to achieve real-time performance on general purpose embedded platforms.
Parallelization and vectorization are very eective ways to ease this problem and make it possible to implement such ConvNets on energy efficient embedded platforms. This thesis presents the evaluation of a novel ConvNet for road speed sign detection, on a breakthrough 57-core Intel Xeon Phi
processor with 512-bit vector support. This mapping demonstrates that the parallelism inherent in the ConvNet algorithm can be effectively exploited by the 512-bit vector ISA and by utilizing the many core paradigm.
Detailed evaluation shows that the best mappings require data-reuse strategies that exploit reuse at the cache and register level. These implementations are boosted by the use of low-level vector intrinsics (which are
C style functions that map directly onto many Intel assembly instructions).
Ultimately we demonstrate an approach which can be used to accelerate Neural Networks on highly-parallel many core processors, with execution speedups of more than 12x on single core performance alone.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
An update on the Intel Neuromorphic Research Community’s growth and benchmark results, including the addition of new corporate members and numerous new benchmarking updates computed on Intel’s neuromorphic test chip, Loihi.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
Computing Performance: On the Horizon (2021)Brendan Gregg
Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
Deep Convolutional Network evaluation on the Intel Xeon PhiGaurav Raina
With a sharp decline in camera cost and size along with superior computing power available at increasingly low prices, computer vision applications are becoming ever present in our daily lives. Research shows that Convolutional Neural Networks (ConvNet) can outperform all other methods for
computer vision tasks (such as object detection) in terms of accuracy and versatility.
One of the problems with these Neural Networks, which mimic the brain, is that they can be very demanding on the processor, requiring millions of computational nodes to function. Hence, it is challenging for Neural Network
algorithms to achieve real-time performance on general purpose embedded platforms.
Parallelization and vectorization are very eective ways to ease this problem and make it possible to implement such ConvNets on energy efficient embedded platforms. This thesis presents the evaluation of a novel ConvNet for road speed sign detection, on a breakthrough 57-core Intel Xeon Phi
processor with 512-bit vector support. This mapping demonstrates that the parallelism inherent in the ConvNet algorithm can be effectively exploited by the 512-bit vector ISA and by utilizing the many core paradigm.
Detailed evaluation shows that the best mappings require data-reuse strategies that exploit reuse at the cache and register level. These implementations are boosted by the use of low-level vector intrinsics (which are
C style functions that map directly onto many Intel assembly instructions).
Ultimately we demonstrate an approach which can be used to accelerate Neural Networks on highly-parallel many core processors, with execution speedups of more than 12x on single core performance alone.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
An update on the Intel Neuromorphic Research Community’s growth and benchmark results, including the addition of new corporate members and numerous new benchmarking updates computed on Intel’s neuromorphic test chip, Loihi.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
Computing Performance: On the Horizon (2021)Brendan Gregg
Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsJiannan Ouyang, PhD
This slides were presented at the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16).
Virtual Machine based approaches to workload consolidation, as seen in IaaS cloud as well as datacenter platforms, have long had to contend with performance degradation caused by synchronization primitives inside the guest environments. These primitives can be affected by virtual CPU preemptions by the host scheduler that can introduce delays that are orders of magnitude longer than those primitives were designed for. While a significant amount of work has focused on the behavior of spinlock primitives as a source of these performance issues, spinlocks do not represent the entirety of synchronization mechanisms that are susceptible to scheduling issues when running in a virtualized environment. In this paper we address the virtualized performance issues introduced by TLB shootdown operations. Our profiling study, based on the PARSEC benchmark suite, has shown that up to 64% of a VM's CPU time can be spent on TLB shootdown operations under certain workloads. In order to address this problem, we present a paravirtual TLB shootdown scheme named Shoot4U. Shoot4U completely eliminates TLB shootdown preemptions by invalidating guest TLB entries from the VMM and allowing guest TLB shootdown operations to complete without waiting for remote virtual CPUs to be scheduled. Our performance evaluation using the PARSEC benchmark suite demonstrates that Shoot4U can reduce benchmark runtime by up to 85% compared an unmodified Linux kernel, and up to 44% over a state-of-the-art paravirtual TLB shootdown scheme.
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
This slides were presented at the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15)
Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS co-kernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.
Talk by Brendan Gregg for YOW! 2021. "The pursuit of faster performance in computing is the driving reason for many new technologies and updates. This talk discusses performance improvements now underway that you will likely be adopting soon, for processors (including 3D stacking and cloud vendor CPUs), memory (including DDR5 and high-bandwidth memory [HBM]), disks (including 3D Xpoint as a 3D NAND accelerator), networking (including QUIC and eXpress Data Path [XDP]), runtimes, hypervisors, and more. The future of performance is increasingly cloud-based, with hardware hypervisors and custom processors, meaningful observability of everything down to cycle stalls (even as cloud guests), and high-speed syscall-avoiding applications that use eBPF, FPGAs, and io_uring. The talk also discusses where future performance improvements might be expected, with predictions for new technologies."
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.
In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
XDP (eXpress Data Path) is the new programmable in-kernel fast-path, which is placed as a layer before the existing Linux kernel network stack (netstack).
We claim XDP is not kernel-bypass, as it is a layer before and it can easily fall-through to netstack. Reality is that it can easily be (ab)used to create a kernel-bypass situation, where non of the kernel facilities are used (in form of BPF-helpers and in-kernel tables). The main disadvantage with kernel-bypass, is the need to re-implement everything, even basic building blocks, like routing tables and ARP protocol handling.
It is part of the concept and speed gain, that XDP allows users to avoid calling part of the kernel code. Users have the freedom to do kernel-bypass and re-implement everything, but the kernel should provide access to more in-kernel tables, via BPF-helpers, such that users can leverage other parts of the Open Source ecosystem, like router daemons etc.
This talk is about how XDP can work in-concert with netstack, and proposal on how we can take this even-further. Crazy ideas like using XDP frames to move SKB allocation out of driver code, will also be proposed.
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiGaurav Raina
With a sharp decline in camera cost and size along with superior computing power available at increasingly low prices, computer vision applications are becoming ever present in our daily lives. Research shows that Convolutional Neural Networks can outperform all other methods for computer vision tasks (such as object detection) in terms of accuracy and versatility.
One of the problems with these Neural Networks, which mimic the brain, is that they can be very demanding on the processor, requiring millions of computational nodes to function. Hence, it is challenging for Neural Network algorithms to achieve real-time performance on general purpose embedded platforms. Parallelization is one of the most effective ways to ease this problem and make it possible to implement such Neural Nets on energy efficient embedded platforms.
We present an evaluation of a novel Convolutional Neural Network for Road Speed Sign detection on the new 57 core Xeon Phi processor with 512-bit vector support. This aims to demonstrate that the parallelism inherent in the algorithm can be effectively exploited by the 512-bit vector ISA and by utilizing the many core paradigm.
Ultimately we demonstrate an approach which can be used to accelerate Neural Network based applications on massively parallel many-core processors, with speedups of more than 12x on single core performance alone.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUsJiannan Ouyang, PhD
This slides were presented at the 12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16).
Virtual Machine based approaches to workload consolidation, as seen in IaaS cloud as well as datacenter platforms, have long had to contend with performance degradation caused by synchronization primitives inside the guest environments. These primitives can be affected by virtual CPU preemptions by the host scheduler that can introduce delays that are orders of magnitude longer than those primitives were designed for. While a significant amount of work has focused on the behavior of spinlock primitives as a source of these performance issues, spinlocks do not represent the entirety of synchronization mechanisms that are susceptible to scheduling issues when running in a virtualized environment. In this paper we address the virtualized performance issues introduced by TLB shootdown operations. Our profiling study, based on the PARSEC benchmark suite, has shown that up to 64% of a VM's CPU time can be spent on TLB shootdown operations under certain workloads. In order to address this problem, we present a paravirtual TLB shootdown scheme named Shoot4U. Shoot4U completely eliminates TLB shootdown preemptions by invalidating guest TLB entries from the VMM and allowing guest TLB shootdown operations to complete without waiting for remote virtual CPUs to be scheduled. Our performance evaluation using the PARSEC benchmark suite demonstrates that Shoot4U can reduce benchmark runtime by up to 85% compared an unmodified Linux kernel, and up to 44% over a state-of-the-art paravirtual TLB shootdown scheme.
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
This slides were presented at the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15)
Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS co-kernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.
Talk by Brendan Gregg for YOW! 2021. "The pursuit of faster performance in computing is the driving reason for many new technologies and updates. This talk discusses performance improvements now underway that you will likely be adopting soon, for processors (including 3D stacking and cloud vendor CPUs), memory (including DDR5 and high-bandwidth memory [HBM]), disks (including 3D Xpoint as a 3D NAND accelerator), networking (including QUIC and eXpress Data Path [XDP]), runtimes, hypervisors, and more. The future of performance is increasingly cloud-based, with hardware hypervisors and custom processors, meaningful observability of everything down to cycle stalls (even as cloud guests), and high-speed syscall-avoiding applications that use eBPF, FPGAs, and io_uring. The talk also discusses where future performance improvements might be expected, with predictions for new technologies."
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Deshanand Singh, Director of Software Engineering at Altera, presents the "Efficient Implementation of Convolutional Neural Networks using OpenCL on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Convolutional neural networks (CNN) are becoming increasingly popular in embedded applications such as vision processing and automotive driver assistance systems. The structure of CNN systems is characterized by cascades of FIR filters and transcendental functions. FPGA technology offers a very efficient way of implementing these structures by allowing designers to build custom hardware datapaths that implement the CNN structure. One challenge of using FPGAs revolves around the design flow that has been traditionally centered around tedious hardware description languages.
In this talk, Deshanand gives a detailed explanation of how CNN algorithms can be expressed in OpenCL and compiled directly to FPGA hardware. He gives detail on code optimizations and provides comparisons with the efficiency of hand-coded implementations.
Kernel Recipes 2019 - XDP closer integration with network stackAnne Nicolas
XDP (eXpress Data Path) is the new programmable in-kernel fast-path, which is placed as a layer before the existing Linux kernel network stack (netstack).
We claim XDP is not kernel-bypass, as it is a layer before and it can easily fall-through to netstack. Reality is that it can easily be (ab)used to create a kernel-bypass situation, where non of the kernel facilities are used (in form of BPF-helpers and in-kernel tables). The main disadvantage with kernel-bypass, is the need to re-implement everything, even basic building blocks, like routing tables and ARP protocol handling.
It is part of the concept and speed gain, that XDP allows users to avoid calling part of the kernel code. Users have the freedom to do kernel-bypass and re-implement everything, but the kernel should provide access to more in-kernel tables, via BPF-helpers, such that users can leverage other parts of the Open Source ecosystem, like router daemons etc.
This talk is about how XDP can work in-concert with netstack, and proposal on how we can take this even-further. Crazy ideas like using XDP frames to move SKB allocation out of driver code, will also be proposed.
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
Torsten Hoefler from ETH Zurich presented this deck at the Switzerland HPC Conference.
"Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency-hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks."
Watch the video presentation: http://wp.me/p3RLHQ-gCB
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
In this paper we study NVIDIA graphics processing unit (GPU) along with its computational power and applications. Although these units are specially designed for graphics application we can employee there computation power for non graphics application too. GPU has high parallel processing power, low cost of computation and less time utilization; it gives good result of performance per energy ratio. This GPU deployment property for excessive computation of similar small set of instruction played a significant role in reducing CPU overhead. GPU has several key advantages over CPU architecture as it provides high parallelism, intensive computation and significantly higher throughput. It consists of thousands of hardware threads that execute programs in a SIMD fashion hence GPU can be an alternate to CPU in high performance environment and in supercomputing environment. The base line is GPU based general purpose computing is a hot topics of research and there is great to explore rather than only graphics processing application.
Deep Convolutional Neural Network acceleration on the Intel Xeon PhiGaurav Raina
With a sharp decline in camera cost and size along with superior computing power available at increasingly low prices, computer vision applications are becoming ever present in our daily lives. Research shows that Convolutional Neural Networks can outperform all other methods for computer vision tasks (such as object detection) in terms of accuracy and versatility.
One of the problems with these Neural Networks, which mimic the brain, is that they can be very demanding on the processor, requiring millions of computational nodes to function. Hence, it is challenging for Neural Network algorithms to achieve real-time performance on general purpose embedded platforms. Parallelization is one of the most effective ways to ease this problem and make it possible to implement such Neural Nets on energy efficient embedded platforms.
We present an evaluation of a novel Convolutional Neural Network for Road Speed Sign detection on the new 57 core Xeon Phi processor with 512-bit vector support. This aims to demonstrate that the parallelism inherent in the algorithm can be effectively exploited by the 512-bit vector ISA and by utilizing the many core paradigm.
Ultimately we demonstrate an approach which can be used to accelerate Neural Network based applications on massively parallel many-core processors, with speedups of more than 12x on single core performance alone.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Design and Implementation of Quintuple Processor Architecture Using FPGAIJERA Editor
The advanced quintuple processor core is a design philosophy that has become a mainstream in Scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices permit complex logic systems to be implemented on a single programmable device. The embedded multiprocessors face a new problem with thread synchronization. It is caused by the distributed memory, when thread synchronization is violated the processors can access the same value at the same time. Basically the processor performance can be increased by adopting clock scaling technique and micro architectural Enhancements. Therefore, Designed a new Architecture called Advanced Concurrent Computing. This is implemented on the FPGA chip using VHDL. The advanced Concurrent Computing architecture performs a simultaneous use of both parallel and distributed computing. The full architecture of quintuple processor core designed for realistic to perform arithmetic, logical, shifting and bit manipulation operations. The proposed advanced quintuple processor core contains Homogeneous RISC processors, added with pipelined processing units, multi bus organization and I/O ports along with the other functional elements required to implement embedded SOC solutions. The designed quintuple performance issues like area, speed and power dissipation and propagation delay are analyzed at 90nm process technology using Xilinx tool.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers the first remote GPU Hackathons, a complete schedule of upcoming events, using OpenACC for a biophysics problem, NVIDIA HPC SDK, GCC 10, new resources and more!
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration for deep neural networks
1. University of Surrey
Faculty of engineering and physical sciences
Department of Computing
Final Year Project Report
19/05/2016
Title: Platform agnostic hardware acceleration
for deep neural networks
Student: Callum McMahon
URN: 6279333
Supervisor: Lillian Tang
2. Platform agnostic hardware acceleration for deep neural networks P a g e | 1
Contents
Abstract....................................................................................................................................... 3
Abbreviations .............................................................................................................................. 3
Introduction ................................................................................................................................. 4
Background ............................................................................................................................. 4
Objectives................................................................................................................................ 5
Literature Review ........................................................................................................................ 5
Pre-existing software packages............................................................................................... 5
Exploring Caffe’s OpenCL branch in more depth..................................................................... 5
Theoretical groundwork ........................................................................................................... 7
Multi Layer feed forward perception ..................................................................................... 7
Modern Activation Functions and the Back Propagation algorithm....................................... 8
Weight regularization ........................................................................................................... 9
OpenCL learning resources and reference material............................................................... 11
System Design.......................................................................................................................... 11
Development environment..................................................................................................... 11
Essential Requirements......................................................................................................... 12
Implementation Deliverables.................................................................................................. 12
Technical Challenges ............................................................................................................ 12
Feeding the OpenCL device............................................................................................... 12
OpenCL kernel efficiency considerations ........................................................................... 14
Using clFFT ....................................................................................................................... 15
Implementation Schedule ...................................................................................................... 15
Design specification............................................................................................................... 15
Designing a flexible network architecture ........................................................................... 15
Validation tests................................................................................................................... 16
Class hierarchy .................................................................................................................. 17
Results...................................................................................................................................... 18
Requirement satisfaction ....................................................................................................... 18
Refer to system design, essential and optional requirements, page 11.................................. 18
Test validation Results........................................................................................................... 19
MNIST classification examples.............................................................................................. 20
Result Discussion.................................................................................................................. 20
Evaluation ................................................................................................................................. 21
Further Work ......................................................................................................................... 21
Conclusion............................................................................................................................. 21
Deployment guide ..................................................................................................................... 22
3. Platform agnostic hardware acceleration for deep neural networks P a g e | 2
Bibliography .............................................................................................................................. 23
Appendices ............................................................................................................................... 25
A - Network validation architectures....................................................................................... 25
A.1. MNIST........................................................................................................................... 25
A.2. sin(a).............................................................................................................................. 25
A.3. sort(a, b, c, d, e) ............................................................................................................ 26
A.4. polynomial...................................................................................................................... 26
A.5. MNIST........................................................................................................................... 27
B – clFFT library expeiment................................................................................................... 27
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL ........................ 27
B.2. Program outputs from B.1. Showing only the first column for succinctness. ................... 31
C – Gantt time plans.............................................................................................................. 32
4. Platform agnostic hardware acceleration for deep neural networks P a g e | 3
Abstract
This report provides an overview of resources available for deep neural network machine
learning. Current state of the art software libraries employ massively vectorised training
pipelines, enabling highly parallel computation and hence faster training convergence. Graphics
processing units, provide access to a greater threading capability than a typical central
processing unit. As such, a number of libraries have been developed with alternative fast native
GPU code paths. Current implementations are tightly integrated with the CUDA platform, a
proprietary programming model restricted to Nvidia GPUs.
In response a basic cross platform neural network library has been developed in C++,
demonstrating the feasibility of a single high performance platform agnostic code path. The
library has been built on top of the OpenCL programming framework. OpenCL is maintained by
a non-profit consortium group, Khronos, with implementations available on a number of devices
from different vendors.
Validation tests were performed on multilayer neural networks to assess training performance
and final network accuracy. Training consisted of multiple passes using back propagation and an
adaptive global learning rate.
A network consisting of two hidden linear rectifier layers was trained on the MNIST dataset; a
well known set of labelled greyscale digit images. The best observed error was achieved with a
total of 1099770 trainable parameters over 200 epochs, attaining a classification error of 4.5%.
Each epoch consisted of 5000 stochastic samples and back propagation passes. Total training
time was 53 minutes. Good fast convergence was observed using fewer training epochs. Using
10 epochs, a classification error rate of 9.6% was observed; taking 164.6 seconds of training on
an AMD Fury X.
Training on the Fury X was found to be approximately x5 faster than the i7-6700k. The Fury X
boasts approximately x72 the single floating point performance of the i7-6700k, suggesting
further optimisations can be made.
For demonstration purposes, windows x64 has been explicitly targeted by this release; porting to
another operating system would be trivial. The library has been written against OpenCL version
2.0 in order to take advantage of fine control over job queues. All recent CPUS and GPUs from
AMD and Intel are OpenCL 2.0 capable. Currently Nvidia devices only support OpenCL 1.2, but
2.0 support is likely to come in the near future.
Abbreviations
CPU Central Processing Unit
GPU Graphics Processing Unit
CUDA Compute Unified Device Architecture
OpenCL Open Computing Language
clBLAS OpenCL Basic Linear Algebra Subprograms
clFFT OpenCL Fast Fourier Transform
Linear Unit Linear Unit
Rectified Linear Unit ReLU
LU Linear Unit
SiU Sigmoid Unit
5. Platform agnostic hardware acceleration for deep neural networks P a g e | 4
Introduction
Background
The field of machine learning is currently experiencing renewed interest. Developments in deep
neural network architectures and training methods have resulted in greatly improved model
learning accuracy for difficult tasks. Refinements to techniques are being continually developed,
with error rates as low as 15.2% being reported in difficult tasks such as speech recognition [1].
Companies are making investing large sums into neural network research. See Facebook open
sourcing deep learning modules for Torch [2]. There have been a number of high profile public
successes, as Alphabet’s AlphaGo, the first program to ever beat a professional Go player
without a handicap [3].
Figure 1.1 Google trend data showing the popularity of search terms.
Note the rapid rise of "deep learning" searches.
Deep neural networks are an evolution of single hidden layer neural networks. Whilst the idea of
a distributed computational network was conceived in the late fifties, inspired by biological
models, it was not until the invention of back propagation in 1970[4] that an effective network
training method was available. 1985 saw the first proposal of introducing convolution layers [5].
Since then a large number of new methods have been introduced: weight decay [6], fast
convolution layers using Fourier transforms [7], dropout [8], long short term memory networks
[9].
Demand for increased computational performance has risen with the increasing complexity of
neural networks. In 1995 it was demonstrated that GPUs could be used to effectively train neural
networks [10]. Neural network optimisation is a massively parallel problem, and as such is well
suited to GPU architectures, which give access to a much larger number of threads than a
typical CPU.
GPUs APIs were originally designed with fixed pipeline designed to produce visual effects.
Traditionally it has been very difficult to run exploit GPU parallelism for algorithm computation.
However, graphics API pipelines have become increasingly generic to handle more intricate
computer graphics methods [11][12]. Hardware vendors have subsequently released more
generic compute platforms [13][14][15][16] that can run on code against GPU hardware,
designed for the needs of the scientific computing community. Nvidia CUDA 1.0 was released in
2007, OpenCL 1.0 in 2009. Both OpenCL and CUDA program kernels are based on the C++14
specification.
6. Platform agnostic hardware acceleration for deep neural networks P a g e | 5
CUDA is currently the more mature of the two GPU compute platforms, boasting a wider
selection of libraries [17]. This has directly translated into more widespread CUDA hardware
acceleration for training deep neural networks. In contrast, OpenCL implementations are
generally incomplete or non-existent (table 2.1). However, CUDA is a proprietary platform that
will only run on Nidia’s GPU hardware [18]. OpenCL implementations exist across a range of
hardware from different vendors, including both CPUs and GPUs [19]. OpenCL has the potential
to provide a single unified fast code path for training deep Neural Networks.
Objectives
To develop a basic library deep learning library that utilises OpenCL for all intensive
operations.
Develop an easy to use interface within C++.
Maintain compatibility across as many OpenCL platforms as possible.
Minimise external dependencies to ease setup and increase portability.
Literature Review
Pre-existing software packages
Software Primary language
interface
Other language
interfaces
CUDA GPU
support
OpenCL CPU / GPU
support
Caffe Python C++, Matlab Yes Third party branch from
AMD, but only neared
feature completion as of
late August 2015.
Neon Python Yes No.
Theano Python Yes In development.
Tensorflow Python C++ (graphs only) Yes In development.
Torch Lua C Yes Third party branch in
development.
Figure 2.1.1 An overview of popular deep learning software environments.
None of the popular deep learning libraries provide official OpenCL support. Caffe is the only
library with a feature complete OpenCl branch.
Exploring Caffe’s OpenCL branch in more depth
There a large number of dependencies [20] required for installation. Installations are restricted to
Ubuntu 12.04 or later. Only AMD GPUs are currently supported. Building and deploying the full
caffe OpenCL stack was deemed outside the scope of this project. Test performance metrics are
available on the github page [21], see Fig 2.2.1.
7. Platform agnostic hardware acceleration for deep neural networks P a g e | 6
Platform Speed (images per second)
AMD W9100 & A10-7850k 255
AMD R9 Fury & A10-7850k 261
AMD R290X @1000MHz & A10-7850k 268
AMD S9150 @900MHz & Xeon E5-2640 227
Figure 2.2.1. Training performance using the well known AlexNet network. [22]
The network inputs used by Alexnet were images of 256x256 resolution. Multiplying out the total
number of pixels by the number images processed per second, we can see that OpenCL’s caffe
branch is capable of training approximately 17,104,896 inputs per second on an AMD Fury X.
Platform Speed (images per second)
AMD W9100 & A10-7850k 590
AMD R9 Fury & A10-7850k 699
AMD R290X @1000MHz & A10-7850k 606
AMD S9150 @900MHz & Xeon E5-2640 452
Figure 2.2.2. Recognition performance using AlexNet. [22]
Similarly, we can see that an approximately 45,809,664 inputs per second can be processed.
8. Platform agnostic hardware acceleration for deep neural networks P a g e | 7
Theoretical groundwork
Multi Layer feed forward perception
The perceptron network was first proposed in 1958 by Frank Rosenblatt [24]. Perceptrons are
connected into a directed graph. The
perceptrons at the start of the graph
correspond to the network’s inputs.
Perceptrons at the end of the graph, the
outputs. Input values are passed into the input
perceptrons. Each subsequent perceptron
computes a weighted sum of the outputs from
prior connected perceptrons. The summed
value is then passed through an activation
function, A(x), and passed on through to the
next set of perceptrons. This process is
continued until the network output is reached.
These networks were handcrafted by
tweaking connection weight values. Modern
neural networks employ learning algorithms to
automatically update weight values.
𝐴 𝑥 =
𝑑(max{𝑥, 0})
𝑑𝑥
Figure 2.3.2. The Heaviside step function was the
activation function originally used by Rosenblatt.
It has since been replaced by differentiable
functions. Differentiable activation functions
allow gradient descent to be used to modify
connection weights in such a way that the
network can be taught to output a set of desired
values for a given input.
Figure 2.3.1. A diagram showing how a single
perceptron unit processes inputs within a network.
This process is called a forward pass.
9. Platform agnostic hardware acceleration for deep neural networks P a g e | 8
Modern Activation Functions and the Back Propagation algorithm
Back propagation[3] is widely used as a training algorithm for neural networks. It is a class of
gradient descent algorithm. It works by first performing a forward pass of the network. See [25]
for an overview of the algorithm.
𝑝𝑗 = 𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
Where 𝐴() is an activation function, 𝑤𝑖𝑗 a weight between units 𝑖 and 𝑗 and 𝑝𝑥 is the output of
unit 𝑥. 𝑖 is the index of the unit closest to the input layer.
The activation function must be differentiable so that an error gradient may be calculated. The
sigmoid function is commonly used. The linear rectifier activation function has been shown to
have better characteristics under some conditions [26]. The linear rectifier prevents the
vanishing gradient problem experienced by the sigmoid activation function, where weights of
large magnitude will have activation gradients of 0, or near to 0, which in turns reduces the
weight update deltas to 0, or near 0.
Sigmoid, derivative: 𝐴 𝑥 =
1
(1+𝑒−𝑥 )
𝑑𝐴 𝑥
𝑑𝑥
=
𝑒 𝑥
1+𝑒−𝑥 2 = 𝐴 𝑥 (1 − 𝐴 𝑋 )
Linear rectifier, derivative: 𝐴 𝑥 = 𝑙𝑛(1 + 𝑒 𝑥
)
𝑑𝐴 𝑥
𝑑𝑥
=
1
1+𝑒−𝑥
An error delta is calculated at each output unit by finding the difference between its output and a
desired output value. The error deltas are propagated back through the network to the input
layer, storing deltas at each unit. This is referred to as a backwards pass.
Delta error for output units: 𝛿 𝑝 𝑗
= (𝑝𝑗 − 𝑡𝑗 )
𝑑𝐴 𝑝 𝑗
𝑑𝑥
Where 𝑡𝑗 denotes the 𝑡𝑡ℎ output unit’s target value.
Delta error for inner units: 𝛿 𝑝 𝑗
=
𝑑𝐴 𝑝 𝑗
𝑑𝑥
𝛿 𝑝 𝑖
𝑤𝑖𝑗
𝑂𝑢𝑡𝑔𝑜𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
𝑤𝑖𝑗 is the weight from unit 𝑖 in the previously visited layer, to 𝑗 in the current layer. i.e. 𝑖 is
the index of the unit closest to the output layer.
Finally, weights are moved by value proportional to the error delta at the unit they provide inputs
for. The direction of change is opposite to the sign of the delta. The deltas are proportional to the
rate of change of the network’s error with respect to the incoming weights.
∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 =
𝑑𝐸𝑟𝑟𝑜𝑟
𝑑𝑤𝑖𝑗
Where 𝑎 is the learning rate. 𝑖 is now the index of the unit closest to the input layer.
10. Platform agnostic hardware acceleration for deep neural networks P a g e | 9
The learning rate, 𝑎, must be small enough to allow the network to converge, yet large
enough to give a reasonable training time. Small 𝑎 values may also cause the network to
get stuck in local error minima.
Weight regularization
Weight regularization is commonly applied in one of two forms: weight decay [6], or dropout [8].
Weight regularization is intended to prevent overfitting, whereby the network learns to exactly
produce the training outputs, rather than learning a generalized pattern. Over fitted networks
perform poorly on validation test sets.
Weight decay modification to the weight update rule: ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗 𝑝𝑖 − 𝑑 ∗ 𝑠𝑖𝑔𝑛(𝛿𝑗 𝑝𝑖)
Where 𝑑 is a small decay factor, such that 𝑑 ≪ 𝑎.
Weight decay may however reduce final network performance, as it will create moving global
optima. It is preferable to use dropout where possible. The dropout modification is applied to the
forward pass during training, giving each unit a small probability of outputting a value of 0.
𝑝𝑗 =
𝐴 𝑝𝑖 𝑤𝑖𝑗
𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔
𝑤𝑒𝑖𝑔 ℎ𝑡𝑠
𝑖=1
𝑟𝑛𝑑 0.0, 1.0 < 𝑑
0 𝑟𝑛𝑑 0.0, 1.0 ≥ 𝑑
Where 𝑑 is a small dropout probability such that 0.0 ≤ 𝑑 < 1.0.
Dropout attempts to spread learned patterns across the network, rather than isolated groups of
units.
Convolution Layers and Fast Convolutions
Convolution layers provide method of introducing translation resistant weights into the network
[27]. Units within a convolution layer share weights in a spatial pattern, allowing the network to
quickly generalize for inputs containing translated patterns. Stacked convolution layers can
identify identify extremely complex patterns much more rapidly than a typical multi layer network;
convolution networks have seen great success in many applications.
Figure 2.3.3 A diagram showing how the weights are shared across convolutional layer units.
11. Platform agnostic hardware acceleration for deep neural networks P a g e | 10
Convolution operations can however be expensive for large kernels, being 𝑂(𝑛𝑘2
), where 𝑛 is
the number of units in the convolutional layer, and 𝑘 is the kernel width. It has been recognised
that that convolution theorem can be applied to give greatly reduced computation time of
𝑂(𝑛𝑙𝑜𝑔 𝑛 ) for the forward pass [28].
𝐹 𝑐 . 𝑘 = 𝐹 𝑐 ∗ 𝐹(𝑘)
∴ 𝑐. 𝑘 = 𝐹−1
(𝐹 𝑐 ∗ 𝐹 𝑘 )
The convolution theorem shows that the elementwise product of two matrices is equal to the
product of their fourier transforms. Using the fast fourier transform algorithm, 𝐹(𝑐) and 𝐹(𝑘) can
be computed in 𝑂(𝑛𝑙𝑜𝑔 𝑛 ), where 𝑛 is the number of elements in 𝑐 or 𝑘 (they must have the
same number of elements). Similarly, the backpropagation algorithm may be also be modified to
take advantage of this identity [28].
Delta errors for convolutional output layer: 𝜹𝒋 = 𝑝
𝑑𝐴 𝑝
𝑑𝑝
− 𝒕
Note 𝑝
𝑑𝐴 𝑝
𝑑𝑝
is a matrix of the output layer multiplied element wise with the derivatives of
the activation function.
Delta errors for convolutional inner layer 𝜹𝒋 =
𝑑𝐴 𝒍 𝒋
𝑑𝑝
𝒍𝑖 ∗ 𝒘 𝑇
𝑖𝑗
Where 𝑖 and 𝑗 are now indexes between network layers, rather than units. For the
backwards pass 𝑖 is the index of the layer closest to the output layer.
Where 𝒍𝑖 = 𝑝 𝑖, denoting the matrix of outputs for layer 𝑖.
Weight updates for a convolutional kernel: ∆𝒘𝑖𝑗 = −𝑎(𝜹𝑗 ∗ 𝒍𝑖) =
𝑑𝑬
𝑑𝒘
For the weight updates, 𝑖 is the index of the layer closest to the input layer.
12. Platform agnostic hardware acceleration for deep neural networks P a g e | 11
OpenCL learning resources and reference material
Having never worked with OpenCL before, I ended up working through a number of
tutorials and example programs. Listed below are all the resources I used.
Resource
Type
Name Location
PDF,
specification
OpenCL 2.0 specification https://www.khronos.org/registry/cl/specs
/opencl-2.0.pdf
Website,
reference
clBLAS manual and reference http://clmathlibraries.github.io/clBLAS/
Website,
reference
clFFT manual and reference http://clmathlibraries.github.io/clFFT/
Book Heterogeneous Computing with
OpenCL 2.0, By David Kaeli,
Perhaad Mistry, Dana Schaa and
Dong Ping Zhang
http://developer.amd.com/partners/univer
sity-programs/heterogeneous-computing-
with-opencl/
Website,
tutorial
Oak Ridge laboratory, OpenCL
vector addition tutorial
https://www.olcf.ornl.gov/tutorials/opencl-
vector-addition/
Website,
tutorial
AMD, Intro to OpenCL tutorial http://developer.amd.com/tools-and-
sdks/opencl-zone/opencl-
resources/introductory-tutorial-to-opencl/
Figure 2.4.1. Learning resources
System Design
Development environment
The OpenCL specification is written against C++, and is subsequently the language of choice for
this project.
Windows was chosen as the development environment due to personal familiarity with the visual
studio software package. Visual Studio 2015 is used to provide an up to date implementation of
the C++11 specification. In keeping with the project objectives, windows specific code shall be
restricted to the main.cpp file. All other code will be written with the standard template in mind,
and as such should compile under g++ and run on Linux.
Familiarisation with OpenCL showed that developing optimised kernels is difficult. Consequently,
I decided to employ AMD’s clBLAS library where possible. clBLAS provides a set of common
basic linear algebra kernels. AMD also provides clFFT for computing fast fourier transforms.
clFFT was added as an additional dependency so as to assist in implementing fast convolution
layers (Fig. 3.2.1).
13. Platform agnostic hardware acceleration for deep neural networks P a g e | 12
Essential Requirements
1. A network class capable of:
a. Constructing multi layer feed forward neural networks. The programmer
should be able to easily specify the number of units within each layer.
b. Training neural networks. Training performance must be reported through
cross validation against test data.
c. Testing neural networks. A method must be implemented that returns
information on the network’s the mean standard error across a batch of test
data.
d. Processing inputs. A method must be implemented that allows the network to
accept a single set of inputs from the main program thread, returning the
corresponding output from the network.
2. A layer class that provides a logical ordering of network computational units.
3. An implementation of the back propagation training algorithm.
4. An implementation of the sigmoid activation function and its corresponding
differential.
5. A sample program capable of demonstrating network training and testing functionality
on different OpenCL devices.
6. Unit testing, testing trained Network accuracy by validating against a dataset
generated from a mathematical function.
Optional Requirements
1. Unit testing, testing trained Network accuracy by validating against a well known pre-
constructed dataset.
2. Implementation of a convolutional layer and convolutional kernel classes. These must
provide:
a. Weight sharing across spatially separated neuron units.
b. Modification to the back propagation algorithm to handle shared weights.
3. An implementation of the linear rectifier activation function and its corresponding
differential.
4. Network regularization. Either through weight decay or dropout.
Implementation Deliverables
1. A Visual Studio 2015 C++ solution containing a working example of the developed deep
neural network library.
2. Headers and associated .cpp definitions with comments describing how the library works.
3. OpenCL kernel code.
4. clBLAS and clFFT included as dynamic link libraries.
Technical Challenges
Feeding the OpenCL device
OpenCL provides a high latency, high throughput bridge between the host device and the
compute device. The host device and compute device share one or more queues. The host
produces jobs and inserts them into a queue. The compute device consumes job items from the
queue. By default, OpenCL creates a serial queue, forcing the compute device to compute jobs
in order. This is not ideal, as some jobs may take only a fraction of the compute device’s
14. Platform agnostic hardware acceleration for deep neural networks P a g e | 13
resources. Setting CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the cl_queue will
enable the device to consume jobs out of order.
Each job is associated with an event, which may be in one of four states, queued, submitted,
running, and complete. Jobs are also associated with event completion wait lists, allowing for
synchronization and dependency blocks. Ideally the work queue will be saturated so that
compute device can be continually working on jobs.
Figure 3.1.1. A visualization of job how the queue controls job consumption. The queue is saturated, there are
more jobs available for the compute device to consume, as shown by the line in red. The host device is
shown in green. Independent jobs are undertaken either in parallel, or in an undetermined serial order.
Behaviour is undefined if the compute device attempts to write or read from cl_mem buffers
being modified by the host device. The reverse is also true. Consequently, the queue must be
utilised to stall both host and compute device until read / write operations are finished. The
number of read and write operations between the host and compute device should be minimised
in order to prevent stalls. As such, as much data as possible should be kept device side.
15. Platform agnostic hardware acceleration for deep neural networks P a g e | 14
Figure 3.1.2. A forced synchronization point. The host is attempting to read the cl_mem holding the network’s
output. A backward gather compute job is available, but cannot be consumed until the host has finished its
read.
OpenCL kernel efficiency considerations
OpenCL kernels are small programs that run on the openCL compute device. OpenCL kernels
are compiled using an OpenCL device context by the host at program start up. The host can
then queue the kernel binary to the compute device as part of a compute task. Similarly, the
openCL host can queue read or write operations to modify or view the contents of cl_mem
buffers held in the compute device’s global cache.
Figure 3.1.3. A depiction of the hardware differences exposed by OpenCL. OpenCL devices typically have
access to much larger number of threads. An AMD Fury X GPU has access to 4096 threads.
16. Platform agnostic hardware acceleration for deep neural networks P a g e | 15
The specification is designed with massive parallelism in mind. An instance of a submitted kernel
program is launched for each thread in the global work group. The global work group is
subdivided into equally sized local work groups. Each thread has access to a small but very fast
local memory cache, and a slower, but larger work group memory cache. All threads have
access to the global memory cache. Threads may only communicate within their work group.
Task division is primarily achieved using the thread’s unique id, which lies in the range
0 >= x< global work group size. Kernels jobs are only marked as complete once all their threads
have finished, as such the kernel is only as fast as its slowest thread.
It is also worth noting that GPUs often implement reduced instruction sets. Consequently some
function calls can have large overheads. For example, the modulo operator is expensive on
AMD GPU hardware.
Using clFFT
The clFFT library is relatively complex, yet I could only find three example programs. I
subsequently created a small program to see if I could successfully transform real valued 2D
matrix into the complex frequency domain, then back again to the spatial domain. The test was
successful. See Appendix B.1. for the code and B.2. for results.
Implementation Schedule
For the original implementation schedule, refer to Appendix C.1. A modified schedule was
created in at the end of December 2015 after the initial project proposal was recognized to be
too complex for the given time frame. See Appendix C.2. Originally I had hoped to demonstrate
basic speech recognition capabilities; however this would require that convolution features be
fully implemented. Other commitments meant that I was unsure whether or not convolution layer
functionality could be implemented in time. Instead I decided that the implementation would
benefit from greater focus on testing core multi layer network functionality and performance.
Design specification
Designing a flexible network architecture
Rather than adding computational units directly into the Layer class, it was decided to wrap them
within a pool class. This gives the programmer
more flexibility when defining network
architecture, as shown by Fig 3.2.1. This was
an early design decision, a result of designing
a way in which convolution layers and
standard unit layers could be integrated in a
complimentary fashion, rather than forcing the
programmer to choose between one or the
other. Layers enforce the sequence in which
the forward and backward passes visit units.
Pass are performed in parallel for pools in the
same layer. MatrixPools are pools of standard
units with biases. ConvPools are pools of
convolutional units arraned into a 2D matrix.
ConvPool units share a single bias between
them for each incoming convolutional kernel.
Figure 3.2.1. A network architecture example that might be used.
17. Platform agnostic hardware acceleration for deep neural networks P a g e | 16
Validation tests
All training outputs are normalised into the range 0.0 to 1.0 such that they are compatible with
logistic units typically used by output layers. Linear rectifiers are not suitable for use in the
network output layer.
1. MNIST handwritten character recognition, 60,000 labelled training images, 10,000
labelled testing images. [29]. Network input of 28x28 = 784 LU. Output of 10 SiLU, with
the index of the unit of largest responsive corresponding to the digit’s classification.
Generating random values a, b, c, d, e in the range 0.0 to 1.0.
2. Sin(a), 1000 testing values, 200 training values. Network input of 1 LU. Output of 1 SiLU.
3. sort(a, b, c, d, e) sorting 5 parameters, 1000 testing values, 200 training values. Network
input of 5 LU. Output of 5 SiLU.
4. polynomial, 3.0f*a*a + a + 7.0f*b + 1.0f, 1000 testing values, 200 training values. Network
input of 2 LU. Output of 1 SiLU.
18. Platform agnostic hardware acceleration for deep neural networks P a g e | 17
Class hierarchy
Figure 3.3.1. A UML diagram showing the basic relationship between network classes. Important field
members are shown. The Network class is intended to provide the primary interface used by the programmer.
19. Platform agnostic hardware acceleration for deep neural networks P a g e | 18
Results
Requirement satisfaction
Refer to system design, essential and optional requirements, page 11.
1. a. Full compliance
1. b. Full compliance.
1. c. Full compliance.
1. d. Full compliance.
2. Full compliance.
3. Full compliance.
4. Full compliance.
5. Full compliance.
6. Full compliance.
Optional Requirements
1. Full compliance, MNIST [29] handwritten digit dataset validation provided.
2. Partial compliance. clFFT tests completed. Interface and class structure for
convolution units and kernels added. No implementations currently present.
3. Full compliance. Linear rectifiers are used as the default activation function for hidden
layers.
4. No compliance. A test was conducted with weight decay, but was not found to
increase network test validation accuracy. Consequently it was decided not to include the
weight modification change. Further resting required.
20. Platform agnostic hardware acceleration for deep neural networks P a g e | 19
Test validation Results
Table 4.1.1. Results from validation runs with varying epoch numbers. The initial learn rate for all tests was
0.001.
OpenCL Device
Validation
test type
Training
time
(seconds) Epochs
Network
structure
Training
sample
selection
Training
passes
per
epoch
Mean
standard
error
Classifica
tion
error
AMD Fury X
(8192 GFlops) MNIST 10.941 5
Appendix
A.1 random 2000 0.1198 0.1819
Intel i7-6700k
(114 Gflops) MNIST 65.4442 5
Appendix
A.1 random 2000 0.1375 0.187
AMD Fury X
(8192 GFlops) MNIST 21.6659 10
Appendix
A.1 random 2000 0.1167 0.1617
Intel i7-6700k
(114 Gflops) MNIST 135.6973 10
Appendix
A.1 random 2000 0.1118 0.1614
AMD Fury X
(8192 GFlops) MNIST 42.4356 20
Appendix
A.1 random 2000 0.1035 0.1509
Intel i7-6700k
(114 Gflops) MNIST 262.8941 20
Appendix
A.1 random 2000 0.0873 0.1358
AMD Fury X
(8192 GFlops) Sin(x) 9.3542 20
Appendix
A.2 all 800 0.0124 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 19.0043 20
Appendix
A.2 all 800 0.0084 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.3163 20
Appendix
A.2 all 800 0.0334 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.7893 20
Appendix
A.2 all 800 0.0293 N/A
AMD Fury X
(8192 GFlops) Sin(x) 9.2561 20
Appendix
A.2 all 800 0.1025 N/A
Intel i7-6700k
(114 Gflops) Sin(x) 18.0904 20
Appendix
A.2 all 800 0.0128 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.9588 20
Appendix
A.3 all 800 0.2039 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.0974 20
Appendix
A.3 all 800 0.2178 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6606 20
Appendix
A.3 all 800 0.191 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 173.9321 20
Appendix
A.3 all 800 0.1462 N/A
AMD Fury X
(8192 GFlops)
Sort(a, b,
c, d, e) 25.6168 20
Appendix
A.3 all 800 0.1807 N/A
Intel i7-6700k
(114 Gflops)
Sort(a, b,
c, d, e) 169.9004 20
Appendix
A.3 all 800 0.1918 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.789 20
Appendix
A.4 all 800 0.0209 N/A
Intel i7-6700k
(114 Gflops) Polynomial 90.2876 20
Appendix
A.1.4 all 800 0.0315 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.508 20
Appendix
A.4 all 800 0.0185 N/A
Intel i7-6700k
(114 Gflops) Polynomial 90.7548 20
Appendix
A.4 all 800 0.0234 N/A
AMD Fury X
(8192 GFlops) Polynomial 17.5351 20
Appendix
A.4 all 800 0.0203 N/A
Intel i7-6700k
(114 Gflops) Polynomial 87.968 20
Appendix
A.4 all 800 0.0239 N/A
AMD Fury X
(8192 GFlops) MNIST 3183.586 200
Appendix
A.5 random 5000 0.027 0.0454
AMD Fury X
(8192 GFlops) MNIST 164.564 10
Appendix
A.5 random 5000 0.0612 0.0956
21. Platform agnostic hardware acceleration for deep neural networks P a g e | 20
MNIST classification examples
Figure 4.2.1. A randomly sampled 2
misclassified by the neural network a 0.
Figure 4.2.1. A randomly sampled 2
that is correctly classified.
Figure 4.2.1. A randomly sampled
5 that is correctly classified.
Result Discussion
Taking the mean average ratio of i7-6700k run times over Fury X run times from table 4.1.1,
gives a mean ratio of 4.97. This is low considering the Fury X has 8192 GFlops of compared to
the i7-6700k’s 114, which would suggest a ratio closer to 72. It is possible that the task queue is
not saturated, and that the OpenCL device is idling for a number of cycles, which would suggest
the main threa is causing throttling. Alternatively, it is possible that an OpenCL kernel is causing
a bottleneck due to poor optimisation. Further investigation is required.
Overall performance is acceptable on the Fury X, but has a some way to go before have
comparable performance of popular public libraries. The 10 epoch fury X test with a 5000
sample rate completed training in 165 seconds, and had 1,099,770 trainable parameters. A
22. Platform agnostic hardware acceleration for deep neural networks P a g e | 21
similar network was setup within python using theaon, via python and lasagne, to provide a
reference. The theano network had 945,768 parameters, and achieved a training time of 44
seconds on an i7-7600k over 10 epochs. Final accuracy was relatively similar. My OpenCL
implementation achieved a misclassification rate of 10%. Theano achieved an error of 8%.
Recognition rate was good, taking 15.5 seconds to recognise all 10,000 MNIST test images,
giving an image per second rate of 645. Multiplying out by the size of the input 28x28 = 784, this
gives a total rate of 505,680 inputs processed per second. Caffe’s OpenCL branch is
approximately 90x faster as processing inputs, and significantly faster at training. Though it is
worth noting that batching is used for the caffe test results published on github.
The i7-6700k’s training could be quite long on my OpenCL implementation. For example, the 20
epoch MNIST test with a 2000 sample rate took 136 seconds, despite having only 218,842
training parameters.
A longer training session was undertaken using the network described in Appendix A.1.5.,
achieving a good final error rate of 4.5%, the same as what was achieved by a two layer neural
network by a popular publication on document recognition [29][30]. The network also proved
accurate over the modelled mathematical functions: sin(x), sort(a, b, c, d, e) and the polynomial
function, achieving best respective errors of 8.4%, 15%, 19%.
Evaluation
Further Work
1. Debugging performance issues.
2. Finishing integration of optional requirements.
3. Possibly worth investigating the removal of the majority of queue jobs by calling kernels
from the device. OpenCL 2.0 allows compute devices to make kernel calls. This feature
was not explored, as it adds significant design complexity. clBLAS would have to be
modified to handle custom kernel post / pre callback. clFFT supports this feature.
Conclusion
Considering the complexity of the project, I believe the outcome to be reasonable. A cross
platform deep learning library was developed in C++, and demonstrated to work successfully on
a range of tasks. Though performance was not ideal, I am confident the bottlenecks could be
identified by isolating the execution times for the called OpenCL kenerls.
23. Platform agnostic hardware acceleration for deep neural networks P a g e | 22
Deployment guide
Hardware requirements:
OpenCL 2.0 compatible device
x64 Windows environment (tested on windows, 7, 9, 10)
Software requirements:
AMD App SDK 3.00 or greater
Building from source requires visual studio 2015 or newer
1. Proceed to http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-
parallel-processing-app-sdk/.
2. Download and install AMD APP SDK 3.0 for windows 64 bit.
3. Unzip Code_Base.zip
Running the binary:
4. Proceed to the “./Backpropagation/Bin” folder
5. Run Backpropagation.exe
Compiling from source:
4. Proceed to the “./Backpropagation/Backpropagation” folder
5. Open visual studio 2015
6. Click file -> open project/solution
7. Open Backpropagation.sln
8. Press ctl + f5 to compile and run
24. Platform agnostic hardware acceleration for deep neural networks P a g e | 23
Bibliography
[1] Sainath, Tara N., Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. "Deep
convolutional neural networks for LVCSR." InAcoustics, Speech and Signal Processing (ICASSP), 2013
IEEE International Conference on, pp. 8614-8618. IEEE, 2013.
[2] https://research.facebook.com/blog/fair-open-sources-deep-learning-modules-for-torch/
[3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray
Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the Game of Go with Deep Neural
Networks and Tree Search." Nature 529, no. 7587 (2016): 484.
[4] Linnainmaa, Seppo. "The representation of the cumulative rounding error of an algorithm as a Taylor
expansion of the local rounding errors." Master's Thesis (in Finnish), Univ. Helsinki (1970): 6-7.
[5] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by
error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE
SCIENCE, 1985.
[6] Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1988. Learning representations by back-propagating
errors. Cognitive modeling, 5(3), p.714.
[7] Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through
FFTs." arXiv preprint arXiv:1312.5851 (2013).
[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple
way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1),
pp.1929-1958.
[9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory."Neural computation 9, no. 8
(1997): 1735-1780.
[10] Martínez-Zarzuela, Mario, Francisco Javier Díaz Pernas, José Fernando Díez Higuera, and Míriam
Antón Rodríguez. "Fuzzy ART neural network parallel computing on the GPU." In Computational and
Ambient Intelligence, pp. 463-470. Springer Berlin Heidelberg, 2007.
[11] (Shader model 5 for DirectX), accessed 21/ 05/ 2016,
https://www.google.co.uk/search?q=shader+model+5&oq=shader+model+5&aqs=chrome..69i57.3354j0j7
&sourceid=chrome&ie=UTF-8
[12] John Kessenich, Dave Baldwin, Randi Rost, “The OpenGL Shader language”,
https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf
[13] http://www.nvidia.co.uk/object/cuda-parallel-computing-uk.html, accessed 22/05/2016
[14] https://www.khronos.org/opencl/, accessed 22/05/2016
[15] http://developer.amd.com/tools-and-sdks/opencl-zone/, accessed 22/05/2016
[16] https://software.intel.com/en-us/intel-
opencl?cid=sem43700008896000156&intel_term=intel+openCL&gclid=CjwKEAjwsYW6BRCTzvu5y8DP
hi0SJABnGLlHWfkJo5tNdbBubNlnsqdz_nyHUSfm6SPPlECfXbtAgxoCSvXw_wcB&gclsrc=aw.ds,
accessed 22/05/2016
25. Platform agnostic hardware acceleration for deep neural networks P a g e | 24
[17] https://developer.nvidia.com/gpu-accelerated-libraries, accessed 22/05/2016
[18] https://developer.nvidia.com/cuda-gpus, accessed 22/05/2016
[19] https://www.khronos.org/conformance/adopters/conformant-products#opencl, accessed 22/05/2016
[20] https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL, accessed 22/05/2016
[21] https://github.com/amd/OpenCL-caffe, accessed 22/05/2016
[22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1097-1105).
[23] Kulkarni, Sanjeev, and Harman, Gilbert. "Multilayer Networks." In Wiley Series in Probability and
Statistics, 99-115. Hoboken, NJ, USA: John Wiley & Sons, 2011.
[24] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in
the brain." Psychological review 65, no. 6 (1958): 386.
[25] Narsky, Ilya, and Porter, Frank C. "Neural Networks." In Statistical Analysis Techniques in Particle
Physics, 251-63. Weinheim, Germany: Wiley‐VCH Verlag GmbH & KGaA, 2013. Chapter 12.
[26] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."
In International Conference on Artificial Intelligence and Statistics, pp. 315-323. 2011.
[27] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural
networks applied to visual document analysis. In null(p. 958). IEEE.
[28] Mathieu, M., Henaff, M. and LeCun, Y., 2013. Fast training of convolutional networks through
FFTs. arXiv preprint arXiv:1312.5851.
[29] http://yann.lecun.com/exdb/mnist/, accessed 22/05/2016
[30] LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proceedings of the IEEE86, no. 11 (1998): 2278-2324.
26. Platform agnostic hardware acceleration for deep neural networks P a g e | 25
Appendices
A - Network validation architectures
A.1. MNIST
Trainable parameters 21,8842
A.2. sin(a)
Trainable parameters 387
27. Platform agnostic hardware acceleration for deep neural networks P a g e | 26
A.3. sort(a, b, c, d, e)
Trainable parameters 36,259
A.4. polynomial
Trainable parameters 17,285
28. Platform agnostic hardware acceleration for deep neural networks P a g e | 27
A.5. MNIST
Trainable parameters 1,099,770
B – clFFT library expeiment
B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL
/* ************************************************************************
* Copyright 2013 Advanced Micro Devices, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* ************************************************************************/
/* ************************************************************************
* Copyright Callum McMahon
*
* Added inverse hermitian transform, showing how data can
* be to transformed back to spatial domain.
* Terminal outputs after inverse should match the original dataset.
* ************************************************************************/
/* No need to explicitely include the OpenCL headers */
30. Platform agnostic hardware acceleration for deep neural networks P a g e | 29
size_t buffer_size_y = ((N0+2) * N1) * sizeof(*Y);
X = (float *)malloc(buffer_size_x);
Y = (float *)malloc(buffer_size_y);
/* print input array just using the
* indices to fill the array with data */
printf("nPerforming fft on an two dimensional array of size N0 x N1 : %ld x
%ldn", N0, N1);
int i, j;
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
X[idx] = sin(1.0f*(float)i) + cos(0.4f*(float)j);
printf("n(%f) ", X[idx]);
}
printf("n");
}
/* Prepare OpenCL memory objects and place data inside them. */
bufX = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_x, NULL, &err);
//CL_MEM_READ_ONLY
bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_y, NULL, &err);
err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clOutStrides);
err = clfftSetPlanInStride(planHandle, dim, clInStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufX, &bufY, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* print output array */
printf("nnfft result: n");
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<fac; ++j) {
unsigned idx = 2 * (j + i*fac);
printf("n(%f) ", sqrt(Y[idx] * Y[idx] + Y[idx+1] * Y[idx+1]));
//fiddle with restults to test
//Y[idx] += 0.01f*(float)idx;
31. Platform agnostic hardware acceleration for deep neural networks P a g e | 30
}
printf("n");
}
printf("n");
//*****************
//revserse!
//*****************
printf("nn *** reverse ***nn");
//clOutStrides[0] = { 1, fac };
//clInStrides[0] = { 1, N0 };
err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL,
NULL);
/* Create a default plan for a complex FFT. */
err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);
/* Set plan parameters. */
err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
err = clfftSetLayout(planHandle, CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL);
err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE);
err = clfftSetPlanOutStride(planHandle, dim, clInStrides);
err = clfftSetPlanInStride(planHandle, dim, clOutStrides);
/* Bake the plan. */
err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);
/* Execute the plan. */
err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL,
&bufY, &bufX, NULL);
/* Wait for calculations to be finished. */
err = clFinish(queue);
/* Fetch results of calculations. */
err = clEnqueueReadBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL,
NULL);
i = j = 0;
for (i = 0; i<N0; ++i) {
for (j = 0; j<N1; ++j) {
float x = 0.5f;
float y = 0.5f;
unsigned idx = (j + i*N0);
printf("n(%f) ", X[idx]);
}
printf("n");
}
//*****************
//revserse END
//*****************
/* Release OpenCL memory objects. */
clReleaseMemObject(bufX);
free(X);
clReleaseMemObject(bufY);
free(Y);
/* Release the plan. */
32. Platform agnostic hardware acceleration for deep neural networks P a g e | 31
err = clfftDestroyPlan(&planHandle);
/* Release clFFT library. */
clfftTeardown();
/* Release OpenCL working objects. */
clReleaseCommandQueue(queue);
clReleaseContext(ctx);
getchar();
return ret;
}
B.2. Program outputs from B.1. Showing only the first column for succinctness.
Platform found: Intel(R) OpenCL
Device found on the above platform: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Performing fft on an two dimensional array of size N0 x N1 : 8 x 8
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)
fft result:
(11.271166)
(27.725875)
(11.865518)
(8.765699)
(8.040510)
*** reverse ***
(1.000000)
(0.921061)
(0.696707)
(0.362358)
(-0.029200)
(-0.416147)
(-0.737394)
(-0.942222)
33. Platform agnostic hardware acceleration for deep neural networks P a g e | 32
C – Gantt time plans
C.1. Original Gantt time plane
34. Platform agnostic hardware acceleration for deep neural networks P a g e | 33
C.2. Modified Gantt time plane