Team 6 is comprised of 5 members: Sourabh Ketkale, Sahil Kaw, Siddhi Pai, Goutham Nekkalapu, and Prince Jacob Chandy. The document discusses several techniques for optimizing neural network performance on different hardware, including using 8-bit quantization, SSE3 and SSE4 instruction sets, batching, lazy evaluation, batched lazy evaluation, and implementing neural networks on the Xeon Phi processor using techniques such as data parallelism and task parallelism. It also discusses using FPGAs and distributed systems to achieve large-scale deep learning.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
General purpose computation on graphics processing unit (GPU) is prominent in the high performance computing era of this time. Porting or accelerating the data parallel applications onto GPU gives the default performance improvement because of the increased computational units. Better performances can be seen if application specific fine tuning is done with respect to the architecture under consideration. One such very widely used computation intensive kernel is sparse matrix vector multiplication (SPMV) in sparse matrix based applications. Most of the existing data format representations of sparse matrix are developed with respect to the central processing unit (CPU) or multi cores. This paper gives a new format for sparse matrix representation with respect to graphics processor architecture that can give 2x to 5x performance improvement compared to CSR (compressed row format), 2x to 54x performance improvement with respect to COO (coordinate format) and 3x to 10 x improvement compared to CSR vector format for the class of application that fit for the proposed new format. It also gives 10% to 133% improvements in memory transfer (of only access information of sparse matrix) between CPU and GPU. This paper gives the details of the new format and its requirement with complete experimentation details and results of comparison.
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector units in modern CPUs. So, we present node-level optimizations of B-spline evaluations on multicore and manycore shared memory processors.
To increase single instruction multiple data (SIMD) efficiency and bandwidth utilization, we first apply data layout transformation from an array of structures (AoS) to a structure of arrays (SoA). Then, by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations, resulting with performance enhancements. Finally, we employ roofline performance analysis to model the impacts of our optimizations.
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
Embedded system software is highly constrained from performance, memory footprint, energy consumption
and implementing cost view point. It is always desirable to obtain better Instructions per Cycle (IPC).
Instruction cache has major contribu
tion in improving IPC. Cache memories are realized on the same chip
where the processor is running. This considerably increases the system cost as well. Hence, it is required to
maintain a trade
-
off between cache sizes and performance improvement offered.
Determining the number
of cache lines and size of cache line are important parameters for cache designing. The design space for
cache is quite large. It is time taking to execute the given application with different cache sizes on an
instruction set simula
tor (ISS) to figure out the optimal cache size. In this paper, a technique is proposed to
identify a number of cache lines and cache line size for the L1 instruction cache that will offer best or
nearly best IPC. Cache size is derived, at a higher abstract
ion level, from basic block analysis in the Low
Level Virtual Machine (LLVM) environment. The cache size estimated from the LLVM environment is cross
validated by simulating the set of benchmark applications with different cache sizes in SimpleScalar’s out
-
of
-
order simulator. The proposed method seems to be superior in terms of estimation accuracy and/or
estimation time as compared to the existing methods for estimation of optimal cache size parameters (cache
line size, number of cache lines).
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure. ML learning is applied to solve a wide range of problems at Netflix.
Timely genome analysis requires a fresh approach to platform design for big data problems. Louisiana State University has tested enterprise cluster deployments of Redis with a unique solution that allows flash memory to act as extended RAM. Learn about how this solution allows large amounts of data to be handled with a fraction of the memory needed for a typical deployment.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
Similar to DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0 (20)
1. Team 6:
Sourabh Ketkale : 010470785
Sahil Kaw : 010725104
Siddhi Pai : 010702458
Goutham Nekkalapu : 010815233
Prince Jacob Chandy : 010807225
2.
3.
4. Comparison to Optimized BLAS package : For higher order matrices the speedup
of BLAS packages was higher in comparison of the baseline CPU.
Comparison to an optimized GPU implementation: Without batching the GPU
attained 2.8 times speedup to baseline CPU.
5. Linear Quantization: We make use of the 8 bit quantization technique to
convert activations into unsigned character and weights into signed character
with biases which are coded as 32 bit
Intel SSE3: We are able to achieve the 3* speed up because it provides
support to pmaddubsw.
Intel SSE4: These instruction set provide optimization to convert 16 bit to 32
bit instruction and thereby we achieve 9% relative speed improvement over
SSE3 benchmark.
6. BATCHING: With batching we can further overcome the GPU performance by applying
batching on neural networks in bulk so that we can take advantage of CPU caching of
both weights and activation.
LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we
can reduce the number of parameters that needs to be visited at every point and
thereby reducing the number of the arithmetic and memory operations using Gaussian
Selection technique.
BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches
in the speech evaluation readily improve the performance of the CPU over GPU.
7. Auto encoder is an artificial Neural network used for learning efficient codings.
The stacked auto encoder is a deep learning model consists of multiple auto-
encoders.
XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has
8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock
speed. It has 32 KB L1 data cache and 512 KB L2 cache
8. Thread oversubscription means number of thread in parallel is more than the number
of the threads of the XEON PHI supports
It greatly decrease the performance of the XENON PHI as it leads to context switching
and in a many core processor its very expensive
Solution:
MapReduce method can effectively determine the number of threads required by
MKL(Math Kernal libraries) function.
MKL libraries itself also determine the number of threads required by the process but
not suited for model parallelism and asynchronous training
9. Basic Design of Xeon Phi:
Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM
and the memory and thus this time also needs consideration.
To solve this we generally keep all parameters and the temporary variables always stored in
global memory of Xeon Phi and keep on transferring the training dataset.
Parallel Design:
Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation
in each model replica.
Task Parallelism: Is achieved by multiple threads in the XEON PHI
Affinity Mode: Affinity sets up the mapping between the thread and the core.
12. For achieving this kind of computing, one can’t depend upon a single system;
you need ‘large scale distributed systems’
13. You have multiple model
replicas, each consisting
of multiple machines, that
train on different subset
of data. And they publish
updates to the global model
parameter server
Model Parallelism
Data Parallelism
14. Whole system co-design
Model partitioning – working set of the model is stored in L3 cache
Local weight computation at the parameter server
Exploiting Asynchrony (as weight updates are commutative and associative)
Multi-threaded weight updates without locks
Asynchronous batch updates – aggregate the weights and update to parameter server
only when we have large enough aggregation
15. To achieve this, GeePS needs to overcome the challenges of limited GPU memory,
and inter-machine communication (data movement overheads), GPU stalls
Parameter server works by separating the problems of processing data and the
problem of communicating and synchronizing them between different machines
GeePS is a parameter server supporting data-parallel model training
16. The authors tried using an existing state-of-the-art parameter server system (IterStore)
with GPU based ML…
To enable a parameter server to support parallel ML applications running on distributed
GPUs the authors make three important changes:
Explicit use of GPU memory for the parameter cache
Batch-based parameter access methods
Parameter server management of GPU memory on behalf of the application
17. GPUs using a CPU-based parameter server
GPU based parameter server
18.
19. Two ways to achieve parallelism:
• By distributing deep computation into a Hadoop cluster or cloud of computing nodes
• By using field programmable gate arrays (FPGA) hardware acceleration to speed up
computationally intensive deep learning Kernels
20.
21. Performance bottle necks in Deep learning of CNN
Design Distributed Hadoop clusters with separation of kernels processed Standard or
accelerated FPGA based nodes
Design and synthesis of the reconfigurable architecture to support Kernel
acceleration on
Designing a interface library to achieve compatibility between FPGA nodes and
general purpose nodes
22. Kernel Identification
Approach to Distributed Algorithm With FPGA-Based Nodes
Design and Implementation Of Reconfigurable Architecture
For Deep Learning Kernels
Seamless Integration of the Distributed Algorithm with the
Accelerated Kernels
23.
24. To cash on the advantage to achieve fine grain parallelism with the help of
reconfigurable hardware which cannot be done in case of GPU’s
The performance per watt ratio is better with FPGA’s which can exploit computation
power with lower energy consumption on power intensive environments like mobile
devices, data centers
Support with all the open source framework for the
25.
26.
27. A set of programming languages, models and tools
supporting the Intel x86 architecture can also be used
on the Intel Xeon Phi coprocessor with little change.
As a result, instead of redesigning new algorithms or
models for GPU in CUDA or OpenCL.
The vector-intensive algorithms can take advantage of
the above mentioned architecture
28.
29.
30. OpenMP and Intel MKL (Math Kernel Library)
packages are used to parallelize them.
Many matrix multiplications and are tackled by
the Intel MKL packages.
31. lAchieves a 302-fold speedup compared with the
un-optimized sequential algorithm
33. Thread parallelism
Controlled Hogwild
Arbitrary Order of Synchronization
Vectorization
34. Speed up of the algorithm, compared to one
thread on the Xeon Phi and that of on sequential
version executed on Xeon E5
Execution times for all thread counts and CNN
architecture sizes on the Xeon Phi, and the
sequential version on Xeon E5
35.
36.
37.
38. Implements Deep Learning on low cost platforms.
Low platform device adopts task flexible architecture and
multiple parallelism to cover functions of CDBN.
39. complex function
an additional stage
random number generation
Additional tradeoff
Arithmetic Precision
Hardware Parallelism
Memory Input output bandwidth
Random number generator
40. By implementing 3 key features
Deep network learning engine with dual threaded 4 stage task level pipeline.
Deep network inference engine with dynamically reconfigurable systolic PE array.
True Random number generator.
41. High computational throughput and memory bandwidth
Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations
on GPU and INTEL MIC
Hence, we go for many core architecture.
42.
43. For 1D and 2D : Register tiling.
For Multi-channel 2D convolution: Local Memory tiling.
44. On Intel MIC, our solution gets up to 25% of the theoretical
peak performance.
45. Deep Learning algorithms being Computing power intensive, it depends on the use
case scenario to choose the framework and hardware
GPU :
Pro: They provide huge computational power
Can be used as a cluster of GPU’s
But huge power consumption and algorithms have to be designed and implemented again
in CUDA/OpenCL
FPGAs :
Pro : Low power consumption when compared to GPUs
But, design of algorithm on this can be time consuming
A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node
FPGA accelerated Hadoopcluster
46. Xeon Phi co-processor:
Pro : Offers considerable amount of computation power, very easy to migrate to this platform
from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce
method
But, to run huge datasets, should use higher end processor
X86
CPU: Can improve the performance by fixed point implementation, batching and lazy
evaluation.