SlideShare a Scribd company logo
1 of 37
Download to read offline
A CGRA-based Approach
for Accelerating
Convolutional Neural Networks
Masakazu Tanomoto, Shinya Takamaeda-Yamazaki,
Jun Yao, and Yasuhiko Nakashima
Nara Institute of Science and Technology (NAIST), Japan
E-mail: shinya_at_is_naist_jp
IEEE MCSoC'15 @Torino
September 23, 2015
Outline
n  Motivation: Deep learning on embedded computers
l  Target: Convolutional Neural Network (CNN)
n  Our approach: CGRA-based CNN acceleration
l  EMAX (Energy-aware Multi-mode Accelerator eXtension)
l  Mapping CNN on EMAX
n  Evaluation
l  Performance per memory bandwidth
l  Performance per area
n  Conclusion
MCSoC15 Shinya T-Y, NAIST 2
Deep learning
n  Recognition (Convolutional Neural Network (CNN))
l  Extracting high-level features automatically from raw data
l  Ex) Image, speech, and text recognition, image search
n  Reinforcement learning (Deep Q-Network (DQN))
l  Learning appropriate strategy for controlling something
l  Ex) Gaming AI, Robot control
MCSoC15 Shinya T-Y, NAIST 3
Playing Atari 2006 automatically
(Human-level control through deep
reinforcement learning [Nature'15])
Extracted features of human and cat
(Building High-level Features Using Large
Scale Unsupervised Learning [ICML'12])
Convolutional Neural Network (CNN)
n  Nesting multiple processing layers
l  Convolution: Multiple small matrix-matrix multiplications
•  Each weight matrix corresponds to a learned feature map
•  Feature can be automatically learned by error propagation
l  Pooling and Max-out: selection from multiple values
l  Full connection: Large matrix-matrix multiplication
n  Performance Bottleneck: Convolution
l  Numerous small matrix-matrix multiplication with stencil
MCSoC15 Shinya T-Y, NAIST 4
Input Layer Hidden Layers Output Layer
Convolution Pooling Max Out Convolution Full Connection
Motivation: DNN on embedded computers
n  Machine learning on IoT: Learning and decision on edge
computers will become more important
l  Sending all data to data centers?: Network traffic problemL
l  Decision on data centers?: Very long latencyL
n  Challenge: Energy efficient embedded accelerators
l  Why not GPU?: GPU is very energy hungry and requires
absolute energy
•  Not only energy efficiency, but also absolute peak energy amount is
important
l  Why not ASIC?: Limited capability of algorithm customization
•  Algorithms of machine learning are rapidly evolving
l  Why not FPGA?: Energy overhead to building computing logics
l  CGRA?
MCSoC15 Shinya T-Y, NAIST 5
Computation pattern: Full connection
n  Output vector is determined by a simple vector-matrix
multiplication
l  Input and output size is certainly large: more than 1024
l  Weight matrix size is also large
n  GPU is OK: suitable for matrix multiplication
l  GPU has matrix libraries: CUBLAS, ...
MCSoC15 Shinya T-Y, NAIST 6
dot =Weight
Output
Vector
Input
Vector
Computation pattern: Convolution
n  A value of the result matrix is calculated by numerous
matrix-matrix multiplication with a small weight matrix
l  Weight matrix size is usually small: from 3 to 8
n  I know GPU is very fast for matrix-matrix multiplication
l  Really?
MCSoC15 Shinya T-Y, NAIST 7
3
dot
Weight
WeightWeight
=
Weight
Weight
Nextdot
Weight
WeightWeight
dot
Weight
WeightWeight
SGEMM performance on GPU
n  GPU is fast, if the matrix size is large enough
l  GPU is throughput-oriented processor
l  In case of small matrix, parallelisms and memory bandwidth are
not exploited efficiently
MCSoC15 Shinya T-Y, NAIST 8
0
5
10
15
20
25
0
50
100
150
200
250
64 128 256 512 1024 2048 4096
#activewarpsperactivecycle
Performance[GFLOPS]
Matrix size
warp/cycle (small kernel) GFLOPS
warp/cycle (large kernel) NVIDIA
Jetson TK1
(GK20A)
Preprocessing for Convolution on GPU
n  In order to use fast matrix multiplication library of GPU,
data duplication is usually utilized
l  Converting sub-regions into a single large matrix
n  Faster than the naive convolution, but still just a
performance overhead
MCSoC15 Shinya T-Y, NAIST 9
3
k=3
k=3
n
n
Input vector
[n-3,
n-3]
[n-1,
n-1]
Duplication Duplication
[0,0] [0,1] [0,2] [1,0] [1,1] [1,2] [2,0] [2,1] [2,2]
[0,1] [0,2] [0,3] [1,1] [1,2] [1,3] [2,1] [2,2] [2,3]
Duplication
9 (=k2)
(n-2)2
Temporal array for matrix multiplication
Our approach: EMAX
Energy-aware Multi-mode Accelerator eXtension
n  A CGRA of local memory based PEs with several buses
l  Each PE has a local memory for data locality
MCSoC15 Shinya T-Y, NAIST 10
Interconnection
DRAM
CPU
Core
PE PE PE PE
MemoryInterface
EMAX
PE PE PE PE
PE PE PE PE
Real Chip of EMAX
n  12.5mm x 12.5mm in 180nm technology
MCSoC15 Shinya T-Y, NAIST 11
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 12
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 13
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
LMM: Local Memory
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 14
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
FIFO
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 15
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
Execution units
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 16
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
Constant registers
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 17
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
Internal Shuffle Bus
External Shuffle Bus
Processing Element (PE)
n  Local memory on each PE for efficient data locality and
memory bandwidth utilization
MCSoC15 Shinya T-Y, NAIST 18
EX1
LMM
EX FIFO
LMM
FIFO
DIN ADDR
DOUT
EX2
EAG
Memory
Bus
PE
Internal Shuffle Bus
External
Shuffle
Bus
Const Const Const Const Const Const
Memory Bus
EMAX instruction
MCSoC15 Shinya T-Y, NAIST 19
Type1: row#, col#, dist [count] ALU_OP & MEM_OP RGI LMM_CONTROL
Type2: row#, col#, dist [count] ALU_OP
Type3: row#, col#, dist [count] & MEM_OP RGI LMM_CONTROL
32-bit operation add/add3/sub/sub3
16-bitx2 operation mauh/mauh3/msuh3
Misc operation mulh/mmrg3/msad/minl/minl3/mh2bw/
mcas/mmid3/mmax/mmax3/mmin/mmin3
Load from EX_FIFO ldb/ldub/ldh/lhuh/ld
Floating Point Operation fmul/fma3/fadd
32-bit operation and/or/xor
16-bitx2 operation mauh/mauh3/msuh3
Load from LMM or LMM_FIFO ldb/ldub/ldh/lhuh/ld
Store to LMM stb/sth/st/cst
(a) Instruction format
(a) EX1 operation
(b) EX2 operation
(c) LMM operation
Forward propagation
n  Weight matrix is constant in the inter-most loop
l  Assigned into constant registers
n  Index of In increases linearly
l  Burst bulk transfer from the external memory
MCSoC15 Shinya T-Y, NAIST 20
Operations per activation of EMAX
Operations per clock cycle on EMAX
for(i1=0; i1<InDim; i++){
for(j1=0; j1<(Nimg-Nk+1); j1++){
for(i2=0; i2<OutDim; i2++){
for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){
for(ky=0; ky<Nk; iy++){
for(kx=0; kx<Nk; kx++){
Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx];
}
}
}
} } }�
InDim: Dimension of input data, OutDim: Dimension of output data
Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels)
Nk: Convolution window size
Forward propagation
n  Weight matrix is constant in the inter-most loop
l  Assigned into constant registers
n  Index of In increases linearly
l  Burst bulk transfer from the external memory
MCSoC15 Shinya T-Y, NAIST 21
Operations per activation of EMAX
Operations per clock cycle on EMAX
for(i1=0; i1<InDim; i++){
for(j1=0; j1<(Nimg-Nk+1); j1++){
for(i2=0; i2<OutDim; i2++){
for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){
for(ky=0; ky<Nk; iy++){
for(kx=0; kx<Nk; kx++){
Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx];
}
}
}
} } }�
InDim: Dimension of input data, OutDim: Dimension of output data
Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels)
Nk: Convolution window size
Forward propagation
n  Weight matrix is constant in the inter-most loop
l  Assigned into constant registers
n  Index of In increases linearly
l  Burst bulk transfer from the external memory
MCSoC15 Shinya T-Y, NAIST 22
Operations per activation of EMAX
Operations per clock cycle on EMAX
for(i1=0; i1<InDim; i++){
for(j1=0; j1<(Nimg-Nk+1); j1++){
for(i2=0; i2<OutDim; i2++){
for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){
for(ky=0; ky<Nk; iy++){
for(kx=0; kx<Nk; kx++){
Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx];
}
}
}
} } }�
InDim: Dimension of input data, OutDim: Dimension of output data
Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels)
Nk: Convolution window size
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 23
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 24
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
3x3 weight matrix in constant registers
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 25
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
3 Input data sets in LMMs
Same read data is forwarded via FIFOs
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 26
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
Reading data from the constant register, LMM,
and execution unit in the previous stage
Operation result is passed to the next
stage
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 27
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
Final result is stored into LMM in the next
stage
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 28
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
Write back the previous data to the main
memory
CNN on EMAX (3x3 convolution)
MCSoC15 Shinya T-Y, NAIST 29
LMM OPEX1/EX2 FIFO OPConst
Const
w[0][2]
in[i-1][]
LMM LD
FIFO
Const
FIFO
Const
FMUL
Const
w[1][2]
in[ i ][]
LMM LD
FMUL FIFO
Const
FMUL FIFO
Const
w[0][1] w[0][0]
w[1][1] w[1][0]
FMA
Const
w[2][2]
in[i+1][]
LMM LD
FMA FIFO
Const
FMA FIFO
Const
w[2][1] w[2][0]
Preload
FMA FMA FMA
LMM LD
out[i][j]
FADD
Preload
FADD
Drain
FADD
LMM ST
Preload a next input
(in[i+2][]) from memory
Preload a next input
(out[i+1][]) from memory
Drain the previous result to
memory
Loop
Control
MemoryInterface
Store the current result to LMM
ColumnRow
Weight values are assigned on constant registers. Input data are stored on LMM
0
1
2
3
4
5
6
7
0 1 2 3
Next input data set as stencil
Load the next input data from the main
memory
Evaluation setup
n  Benchmark: deep learning datasets and networks
l  Imagenet (Alexnet-2), CIFAR10, MNIST (Lenet)
n  Hardware:
l  CPU (Corei7, ARM), GPU (Desktop, Mobile), EMAX
l  Metric: Performance per bandwidth, Performance per area
•  Estimation from actual LSI of EMAX and software simulations
MCSoC15 Shinya T-Y, NAIST 30
Performance per memory bandwidth
n  EMAX achieves better performance in embedded class
datasets
MCSoC15 Shinya T-Y, NAIST 31
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
EMAX GTX980 GK20A Core i7 ARM
Performance per memory bandwidth
n  EMAX achieves better performance in embedded class
datasets
MCSoC15 Shinya T-Y, NAIST 32
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
EMAX GTX980 GK20A Core i7 ARM
Alexnet:
since matrix size is large,
desktop GPU is 3.17x better
Performance per memory bandwidth
n  EMAX achieves better performance in embedded class
datasets
MCSoC15 Shinya T-Y, NAIST 33
0
2
4
6
8
10
12
14
16
18
Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
Operations/Byte
EMAX GTX980 GK20A Core i7 ARM
CIFAR-10:
1.41x better than
mobile GPU
Lenet:
1.75x better than
mobile GPU
Performance per area
n  EMAX achieves much better performance in embedded
class datasets: CGRA is better for embedded systems?
MCSoC15 Shinya T-Y, NAIST 34
0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7
Performance per area
n  EMAX achieves much better performance in embedded
class datasets: CGRA is better for embedded systems?
MCSoC15 Shinya T-Y, NAIST 35
0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7
Alexnet:
since matrix size is large,
desktop GPU is 2.2x better
Performance per area
n  EMAX achieves much better performance in embedded
class datasets: CGRA is better for embedded systems?
MCSoC15 Shinya T-Y, NAIST 36
0
100
200
300
400
500
600
700
800Alexnet-2C
IFAR
10-1C
IFAR
10-2C
IFAR
10-3
C
IFAR
10
(Avg)
Lenet-1
Lenet-2Lenet(Avg)
AreaPerf[FLOPS/Tr]
EMAX GTX980 Corei7
CIFAR-10:
1.76x better than
mobile GPU
Lenet:
1.95x better than
mobile GPU
Conclusion
n  A CGRA-based acceleration approach of convolutional
neural network (CNN) for embedded accelerators
l  EMAX (Energy-aware Multi-mode Accelerator eXtension)
n  EMAX outperforms GPU in embedded class data sets
l  1.75x better performance per memory bandwidth
l  1.95x better performance per area ( energy)
MCSoC15 Shinya T-Y, NAIST 37
Interconnection
DRAM
CPU
Core
PE PE PE PE
MemoryInterface
EMAX
PE PE PE PE
PE PE PE PE

More Related Content

What's hot

Nvidia Corporate Presentation
Nvidia Corporate PresentationNvidia Corporate Presentation
Nvidia Corporate PresentationShanker Trivedi
 
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...Edge AI and Vision Alliance
 
Project meeting: Android Graphics Architecture Overview
Project meeting: Android Graphics Architecture OverviewProject meeting: Android Graphics Architecture Overview
Project meeting: Android Graphics Architecture OverviewYu-Hsin Hung
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Philip Hammer
 
Crysis 2-key-rendering-features
Crysis 2-key-rendering-featuresCrysis 2-key-rendering-features
Crysis 2-key-rendering-featuresRaimundo Renato
 
OpenJDK トラブルシューティング #javacasual
OpenJDK トラブルシューティング #javacasualOpenJDK トラブルシューティング #javacasual
OpenJDK トラブルシューティング #javacasualYuji Kubota
 
Android for Embedded Linux Developers
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux DevelopersOpersys inc.
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
 
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Opersys inc.
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE Linaro
 
Windows 開発者のための Dev&Ops on AWS
Windows 開発者のための Dev&Ops on AWSWindows 開発者のための Dev&Ops on AWS
Windows 開発者のための Dev&Ops on AWSAmazon Web Services Japan
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...Edge AI and Vision Alliance
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたMITSUNARI Shigeo
 

What's hot (20)

Supercell
SupercellSupercell
Supercell
 
Nvidia Corporate Presentation
Nvidia Corporate PresentationNvidia Corporate Presentation
Nvidia Corporate Presentation
 
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...
“Introducing the Kria Robotics Starter Kit: Robotics and Machine Vision for S...
 
Project meeting: Android Graphics Architecture Overview
Project meeting: Android Graphics Architecture OverviewProject meeting: Android Graphics Architecture Overview
Project meeting: Android Graphics Architecture Overview
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 
Qemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System EmulationQemu JIT Code Generator and System Emulation
Qemu JIT Code Generator and System Emulation
 
Emotiv epoc introduction
Emotiv epoc introductionEmotiv epoc introduction
Emotiv epoc introduction
 
Crysis 2-key-rendering-features
Crysis 2-key-rendering-featuresCrysis 2-key-rendering-features
Crysis 2-key-rendering-features
 
OpenJDK トラブルシューティング #javacasual
OpenJDK トラブルシューティング #javacasualOpenJDK トラブルシューティング #javacasual
OpenJDK トラブルシューティング #javacasual
 
Android for Embedded Linux Developers
Android for Embedded Linux DevelopersAndroid for Embedded Linux Developers
Android for Embedded Linux Developers
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Hardware Accelerated 2D Rendering for Android
Hardware Accelerated 2D Rendering for AndroidHardware Accelerated 2D Rendering for Android
Hardware Accelerated 2D Rendering for Android
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 
AMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD Chiplet Architecture for High-Performance Server and Desktop Products
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
 
BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE BUD17-400: Secure Data Path with OPTEE
BUD17-400: Secure Data Path with OPTEE
 
Windows 開発者のための Dev&Ops on AWS
Windows 開発者のための Dev&Ops on AWSWindows 開発者のための Dev&Ops on AWS
Windows 開発者のための Dev&Ops on AWS
 
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
“Tools for Creating Next-Gen Computer Vision Apps on Snapdragon,” a Presentat...
 
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみた
 

Viewers also liked

PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)Shinya Takamaeda-Y
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011Shinya Takamaeda-Y
 
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)Shinya Takamaeda-Y
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようShinya Takamaeda-Y
 
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Shinya Takamaeda-Y
 
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Shinya Takamaeda-Y
 
マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討Shinya Takamaeda-Y
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようShinya Takamaeda-Y
 
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータPyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータShinya Takamaeda-Y
 
Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Shinya Takamaeda-Y
 
PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングShinya Takamaeda-Y
 
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)Shinya Takamaeda-Y
 
Zynq + Vivado HLS入門
Zynq + Vivado HLS入門Zynq + Vivado HLS入門
Zynq + Vivado HLS入門narusugimoto
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Shinya Takamaeda-Y
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向Shinya Takamaeda-Y
 

Viewers also liked (17)

PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
PyCoRAM (高位合成友の会@ドワンゴ, 2015年1月16日)
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
 
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
PyCoRAMによるPythonを用いたポータブルなFPGAアクセラレータ開発 (チュートリアル@ESS2014)
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみようPythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
Pythonによる高位設計フレームワークPyCoRAMでFPGAシステムを開発してみよう
 
Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...Mapping Applications with Collectives over Sub-communicators on Torus Network...
Mapping Applications with Collectives over Sub-communicators on Torus Network...
 
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
Veriloggen: Pythonによるハードウェアメタプログラミング(第3回 高位合成友の会 @ドワンゴ)
 
マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討マルチパラダイム型高水準ハードウェア設計環境の検討
マルチパラダイム型高水準ハードウェア設計環境の検討
 
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみようPythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
 
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータPyCoRAMを用いたグラフ処理FPGAアクセラレータ
PyCoRAMを用いたグラフ処理FPGAアクセラレータ
 
Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討Pythonを用いた高水準ハードウェア設計環境の検討
Pythonを用いた高水準ハードウェア設計環境の検討
 
PythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミングPythonとVeriloggenを用いたRTL設計メタプログラミング
PythonとVeriloggenを用いたRTL設計メタプログラミング
 
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
コンピュータアーキテクチャ研究の最新動向〜ISCA2015参加報告〜 @FPGAエクストリーム・コンピューティング 第7回 (#fpgax #7)
 
Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門Zynq+PyCoRAM(+Debian)入門
Zynq+PyCoRAM(+Debian)入門
 
Zynq + Vivado HLS入門
Zynq + Vivado HLS入門Zynq + Vivado HLS入門
Zynq + Vivado HLS入門
 
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
 
FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向FPGA・リコンフィギャラブルシステム研究の最新動向
FPGA・リコンフィギャラブルシステム研究の最新動向
 

Similar to A CGRA-based Approach for Accelerating Convolutional Neural Networks

Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1NVIDIA
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Shinya Takamaeda-Y
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 

Similar to A CGRA-based Approach for Accelerating Convolutional Neural Networks (20)

Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 

More from Shinya Takamaeda-Y

オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムShinya Takamaeda-Y
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモShinya Takamaeda-Y
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発Shinya Takamaeda-Y
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Shinya Takamaeda-Y
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Shinya Takamaeda-Y
 
ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)Shinya Takamaeda-Y
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...Shinya Takamaeda-Y
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...Shinya Takamaeda-Y
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)Shinya Takamaeda-Y
 
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発Shinya Takamaeda-Y
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
むかし名言集bot作りました!
むかし名言集bot作りました!むかし名言集bot作りました!
むかし名言集bot作りました!Shinya Takamaeda-Y
 
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化Shinya Takamaeda-Y
 
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...Shinya Takamaeda-Y
 

More from Shinya Takamaeda-Y (14)

オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
 
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモDNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
DNNのモデル特化ハードウェアを生成するオープンソースコンパイラNNgenのデモ
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
 
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)ゆるふわコンピュータ (IPSJ-ONE2017)
ゆるふわコンピュータ (IPSJ-ONE2017)
 
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
 
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
メモリ抽象化フレームワークPyCoRAMを用いたソフトプロセッサ混載FPGAアクセラレータの開発
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
むかし名言集bot作りました!
むかし名言集bot作りました!むかし名言集bot作りました!
むかし名言集bot作りました!
 
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
APGAS言語X10を用いたオンチップネットワークシミュレーションの並列化
 
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
Network Performance of Multifunction On-chip Router Architectures (IEICE-CPSY...
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

A CGRA-based Approach for Accelerating Convolutional Neural Networks

  • 1. A CGRA-based Approach for Accelerating Convolutional Neural Networks Masakazu Tanomoto, Shinya Takamaeda-Yamazaki, Jun Yao, and Yasuhiko Nakashima Nara Institute of Science and Technology (NAIST), Japan E-mail: shinya_at_is_naist_jp IEEE MCSoC'15 @Torino September 23, 2015
  • 2. Outline n  Motivation: Deep learning on embedded computers l  Target: Convolutional Neural Network (CNN) n  Our approach: CGRA-based CNN acceleration l  EMAX (Energy-aware Multi-mode Accelerator eXtension) l  Mapping CNN on EMAX n  Evaluation l  Performance per memory bandwidth l  Performance per area n  Conclusion MCSoC15 Shinya T-Y, NAIST 2
  • 3. Deep learning n  Recognition (Convolutional Neural Network (CNN)) l  Extracting high-level features automatically from raw data l  Ex) Image, speech, and text recognition, image search n  Reinforcement learning (Deep Q-Network (DQN)) l  Learning appropriate strategy for controlling something l  Ex) Gaming AI, Robot control MCSoC15 Shinya T-Y, NAIST 3 Playing Atari 2006 automatically (Human-level control through deep reinforcement learning [Nature'15]) Extracted features of human and cat (Building High-level Features Using Large Scale Unsupervised Learning [ICML'12])
  • 4. Convolutional Neural Network (CNN) n  Nesting multiple processing layers l  Convolution: Multiple small matrix-matrix multiplications •  Each weight matrix corresponds to a learned feature map •  Feature can be automatically learned by error propagation l  Pooling and Max-out: selection from multiple values l  Full connection: Large matrix-matrix multiplication n  Performance Bottleneck: Convolution l  Numerous small matrix-matrix multiplication with stencil MCSoC15 Shinya T-Y, NAIST 4 Input Layer Hidden Layers Output Layer Convolution Pooling Max Out Convolution Full Connection
  • 5. Motivation: DNN on embedded computers n  Machine learning on IoT: Learning and decision on edge computers will become more important l  Sending all data to data centers?: Network traffic problemL l  Decision on data centers?: Very long latencyL n  Challenge: Energy efficient embedded accelerators l  Why not GPU?: GPU is very energy hungry and requires absolute energy •  Not only energy efficiency, but also absolute peak energy amount is important l  Why not ASIC?: Limited capability of algorithm customization •  Algorithms of machine learning are rapidly evolving l  Why not FPGA?: Energy overhead to building computing logics l  CGRA? MCSoC15 Shinya T-Y, NAIST 5
  • 6. Computation pattern: Full connection n  Output vector is determined by a simple vector-matrix multiplication l  Input and output size is certainly large: more than 1024 l  Weight matrix size is also large n  GPU is OK: suitable for matrix multiplication l  GPU has matrix libraries: CUBLAS, ... MCSoC15 Shinya T-Y, NAIST 6 dot =Weight Output Vector Input Vector
  • 7. Computation pattern: Convolution n  A value of the result matrix is calculated by numerous matrix-matrix multiplication with a small weight matrix l  Weight matrix size is usually small: from 3 to 8 n  I know GPU is very fast for matrix-matrix multiplication l  Really? MCSoC15 Shinya T-Y, NAIST 7 3 dot Weight WeightWeight = Weight Weight Nextdot Weight WeightWeight dot Weight WeightWeight
  • 8. SGEMM performance on GPU n  GPU is fast, if the matrix size is large enough l  GPU is throughput-oriented processor l  In case of small matrix, parallelisms and memory bandwidth are not exploited efficiently MCSoC15 Shinya T-Y, NAIST 8 0 5 10 15 20 25 0 50 100 150 200 250 64 128 256 512 1024 2048 4096 #activewarpsperactivecycle Performance[GFLOPS] Matrix size warp/cycle (small kernel) GFLOPS warp/cycle (large kernel) NVIDIA Jetson TK1 (GK20A)
  • 9. Preprocessing for Convolution on GPU n  In order to use fast matrix multiplication library of GPU, data duplication is usually utilized l  Converting sub-regions into a single large matrix n  Faster than the naive convolution, but still just a performance overhead MCSoC15 Shinya T-Y, NAIST 9 3 k=3 k=3 n n Input vector [n-3, n-3] [n-1, n-1] Duplication Duplication [0,0] [0,1] [0,2] [1,0] [1,1] [1,2] [2,0] [2,1] [2,2] [0,1] [0,2] [0,3] [1,1] [1,2] [1,3] [2,1] [2,2] [2,3] Duplication 9 (=k2) (n-2)2 Temporal array for matrix multiplication
  • 10. Our approach: EMAX Energy-aware Multi-mode Accelerator eXtension n  A CGRA of local memory based PEs with several buses l  Each PE has a local memory for data locality MCSoC15 Shinya T-Y, NAIST 10 Interconnection DRAM CPU Core PE PE PE PE MemoryInterface EMAX PE PE PE PE PE PE PE PE
  • 11. Real Chip of EMAX n  12.5mm x 12.5mm in 180nm technology MCSoC15 Shinya T-Y, NAIST 11
  • 12. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 12 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const
  • 13. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 13 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const LMM: Local Memory
  • 14. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 14 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const FIFO
  • 15. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 15 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const Execution units
  • 16. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 16 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const Constant registers
  • 17. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 17 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const Internal Shuffle Bus External Shuffle Bus
  • 18. Processing Element (PE) n  Local memory on each PE for efficient data locality and memory bandwidth utilization MCSoC15 Shinya T-Y, NAIST 18 EX1 LMM EX FIFO LMM FIFO DIN ADDR DOUT EX2 EAG Memory Bus PE Internal Shuffle Bus External Shuffle Bus Const Const Const Const Const Const Memory Bus
  • 19. EMAX instruction MCSoC15 Shinya T-Y, NAIST 19 Type1: row#, col#, dist [count] ALU_OP & MEM_OP RGI LMM_CONTROL Type2: row#, col#, dist [count] ALU_OP Type3: row#, col#, dist [count] & MEM_OP RGI LMM_CONTROL 32-bit operation add/add3/sub/sub3 16-bitx2 operation mauh/mauh3/msuh3 Misc operation mulh/mmrg3/msad/minl/minl3/mh2bw/ mcas/mmid3/mmax/mmax3/mmin/mmin3 Load from EX_FIFO ldb/ldub/ldh/lhuh/ld Floating Point Operation fmul/fma3/fadd 32-bit operation and/or/xor 16-bitx2 operation mauh/mauh3/msuh3 Load from LMM or LMM_FIFO ldb/ldub/ldh/lhuh/ld Store to LMM stb/sth/st/cst (a) Instruction format (a) EX1 operation (b) EX2 operation (c) LMM operation
  • 20. Forward propagation n  Weight matrix is constant in the inter-most loop l  Assigned into constant registers n  Index of In increases linearly l  Burst bulk transfer from the external memory MCSoC15 Shinya T-Y, NAIST 20 Operations per activation of EMAX Operations per clock cycle on EMAX for(i1=0; i1<InDim; i++){ for(j1=0; j1<(Nimg-Nk+1); j1++){ for(i2=0; i2<OutDim; i2++){ for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){ for(ky=0; ky<Nk; iy++){ for(kx=0; kx<Nk; kx++){ Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx]; } } } } } }� InDim: Dimension of input data, OutDim: Dimension of output data Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels) Nk: Convolution window size
  • 21. Forward propagation n  Weight matrix is constant in the inter-most loop l  Assigned into constant registers n  Index of In increases linearly l  Burst bulk transfer from the external memory MCSoC15 Shinya T-Y, NAIST 21 Operations per activation of EMAX Operations per clock cycle on EMAX for(i1=0; i1<InDim; i++){ for(j1=0; j1<(Nimg-Nk+1); j1++){ for(i2=0; i2<OutDim; i2++){ for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){ for(ky=0; ky<Nk; iy++){ for(kx=0; kx<Nk; kx++){ Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx]; } } } } } }� InDim: Dimension of input data, OutDim: Dimension of output data Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels) Nk: Convolution window size
  • 22. Forward propagation n  Weight matrix is constant in the inter-most loop l  Assigned into constant registers n  Index of In increases linearly l  Burst bulk transfer from the external memory MCSoC15 Shinya T-Y, NAIST 22 Operations per activation of EMAX Operations per clock cycle on EMAX for(i1=0; i1<InDim; i++){ for(j1=0; j1<(Nimg-Nk+1); j1++){ for(i2=0; i2<OutDim; i2++){ for(j2=0; j2<(Nimg-Nk+1)*(Nbatch); j2++){ for(ky=0; ky<Nk; iy++){ for(kx=0; kx<Nk; kx++){ Out[i2][j1][j2] += Weight[i1][i2][ky][kx]*In[i1][j1+ky][j2+kx]; } } } } } }� InDim: Dimension of input data, OutDim: Dimension of output data Nimg: Side length of input data, Nbatch: Bath size (= # number of pixels) Nk: Convolution window size
  • 23. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 23 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil
  • 24. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 24 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil 3x3 weight matrix in constant registers
  • 25. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 25 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil 3 Input data sets in LMMs Same read data is forwarded via FIFOs
  • 26. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 26 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil Reading data from the constant register, LMM, and execution unit in the previous stage Operation result is passed to the next stage
  • 27. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 27 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil Final result is stored into LMM in the next stage
  • 28. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 28 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil Write back the previous data to the main memory
  • 29. CNN on EMAX (3x3 convolution) MCSoC15 Shinya T-Y, NAIST 29 LMM OPEX1/EX2 FIFO OPConst Const w[0][2] in[i-1][] LMM LD FIFO Const FIFO Const FMUL Const w[1][2] in[ i ][] LMM LD FMUL FIFO Const FMUL FIFO Const w[0][1] w[0][0] w[1][1] w[1][0] FMA Const w[2][2] in[i+1][] LMM LD FMA FIFO Const FMA FIFO Const w[2][1] w[2][0] Preload FMA FMA FMA LMM LD out[i][j] FADD Preload FADD Drain FADD LMM ST Preload a next input (in[i+2][]) from memory Preload a next input (out[i+1][]) from memory Drain the previous result to memory Loop Control MemoryInterface Store the current result to LMM ColumnRow Weight values are assigned on constant registers. Input data are stored on LMM 0 1 2 3 4 5 6 7 0 1 2 3 Next input data set as stencil Load the next input data from the main memory
  • 30. Evaluation setup n  Benchmark: deep learning datasets and networks l  Imagenet (Alexnet-2), CIFAR10, MNIST (Lenet) n  Hardware: l  CPU (Corei7, ARM), GPU (Desktop, Mobile), EMAX l  Metric: Performance per bandwidth, Performance per area •  Estimation from actual LSI of EMAX and software simulations MCSoC15 Shinya T-Y, NAIST 30
  • 31. Performance per memory bandwidth n  EMAX achieves better performance in embedded class datasets MCSoC15 Shinya T-Y, NAIST 31 0 2 4 6 8 10 12 14 16 18 Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) Operations/Byte EMAX GTX980 GK20A Core i7 ARM
  • 32. Performance per memory bandwidth n  EMAX achieves better performance in embedded class datasets MCSoC15 Shinya T-Y, NAIST 32 0 2 4 6 8 10 12 14 16 18 Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) Operations/Byte EMAX GTX980 GK20A Core i7 ARM Alexnet: since matrix size is large, desktop GPU is 3.17x better
  • 33. Performance per memory bandwidth n  EMAX achieves better performance in embedded class datasets MCSoC15 Shinya T-Y, NAIST 33 0 2 4 6 8 10 12 14 16 18 Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) Operations/Byte EMAX GTX980 GK20A Core i7 ARM CIFAR-10: 1.41x better than mobile GPU Lenet: 1.75x better than mobile GPU
  • 34. Performance per area n  EMAX achieves much better performance in embedded class datasets: CGRA is better for embedded systems? MCSoC15 Shinya T-Y, NAIST 34 0 100 200 300 400 500 600 700 800Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) AreaPerf[FLOPS/Tr] EMAX GTX980 Corei7
  • 35. Performance per area n  EMAX achieves much better performance in embedded class datasets: CGRA is better for embedded systems? MCSoC15 Shinya T-Y, NAIST 35 0 100 200 300 400 500 600 700 800Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) AreaPerf[FLOPS/Tr] EMAX GTX980 Corei7 Alexnet: since matrix size is large, desktop GPU is 2.2x better
  • 36. Performance per area n  EMAX achieves much better performance in embedded class datasets: CGRA is better for embedded systems? MCSoC15 Shinya T-Y, NAIST 36 0 100 200 300 400 500 600 700 800Alexnet-2C IFAR 10-1C IFAR 10-2C IFAR 10-3 C IFAR 10 (Avg) Lenet-1 Lenet-2Lenet(Avg) AreaPerf[FLOPS/Tr] EMAX GTX980 Corei7 CIFAR-10: 1.76x better than mobile GPU Lenet: 1.95x better than mobile GPU
  • 37. Conclusion n  A CGRA-based acceleration approach of convolutional neural network (CNN) for embedded accelerators l  EMAX (Energy-aware Multi-mode Accelerator eXtension) n  EMAX outperforms GPU in embedded class data sets l  1.75x better performance per memory bandwidth l  1.95x better performance per area ( energy) MCSoC15 Shinya T-Y, NAIST 37 Interconnection DRAM CPU Core PE PE PE PE MemoryInterface EMAX PE PE PE PE PE PE PE PE