New Pointwise Convolution
in
Deep Neural Networks
through Extremely Fast and
Non-parametric Transforms
Joonhyun Jeong l Sung-Ho Bae
Kyung-Hee University
01 Background & Motivation
02 Method
03 Conclusion
INDEX
2
Background
&
Motivation
Background
4
Standard convolution kernels
Spatial specific convolution kernels
Channel specific convolution (pointwise convolution) kernels
Depthwise separable convolution
Dk
* notations
N
M
= number of output channels
= number of input channels
= filter size
• But, existing pointwise convolution needs a lot of parameters and FLOPs!
• This study is focused on reducing complexity of number of weights and FLOPs
needed in pointwise convolution through conventional transforms
Motivation
5
others
5%
Pointwise Convolution
95%
others
25%
Pointwise Convolution
75%
Ratio of params in MobileNet-V1 Ratio of FLOPs in MobileNet-V1
Method
• Pointwise Convolution with conventional transforms
• Optimal block structure for conventional transforms
• Optimal hierarchical level layers for conventional transforms
6
METHOD01
Pointwise Convolution(PC) using conventional transforms
Method
Discrete Cosine Transform (DCT) kernels
Discrete Walsh-Hadamard Transform (DWHT) kernels
Conventional pointwise convolution kernels
first kernel
second kernel
third kernel
…
first kernel second kernel
third kernel …
first kernel second kernel third kernel
• Two nice properties of conventional transforms

➢ No learnable parameters needed: no MAC(Memory Access Cost) toward weight parameters.

➢ Fast computation version : 
 O(N2
) O(NlogN)
7
METHOD02
Method
• We can construct efficient neural network in the viewpoint of memory space and fast computation.
• Fast version of Discrete Walsh Hadamard Transform
8
METHOD03
Method
• Fast version of Discrete Cosine Transform : adopted Kok’s fast DCT algorithm (1997)
No Multiplication needed!
DWHT and DCT can be extremely fast!
Baseline(ShuffleNet-V2)
9
METHOD04
Optimal block structure
for conventional transforms
Method
* notations
: Conventional Transform Pointwise ConvolutionCTPC ReLU after CTPC No ReLU after CTPC
Optimal block structure for conventional transforms.
10
METHOD05
Method
➤ Applying ReLU after conventional transform degraded accuracy significantly.
* notations
: CTPC in block (b) is DCT(b)-DCT
: CTPC in block (b) is DCT(b)-DWHT
: CTPC in block (c) is DCT(c)-DCT
: CTPC in block (c) is DWHT(c)-DWHT
11
METHOD06
Method
Optimal hierarchical block levels for conventional transforms (ShuffleNet-V2)
* notations
: baseline block (a)
: (a) blocks in these range are all replaced by our optimal block
• High level
• Low level
• Middle level
High-level model 2
High-level model 1
High-level model 3
Mid-level model 2
Mid-level model 1
Mid-level model 3
Low-level model 2
Low-level model 1
Low-level model 3
11
METHOD07
Method
➤ High-level blocks are favored by the proposed pointwise convolution layer.
Optimal hierarchical block levels for conventional transforms (ShuffleNet-V2)
High level block Middle level block Low level block
accuracy increase 1.49%

79.1% weight reduced
48.4% FLOPs reduced
compared to Baseline model!!!
12
METHOD07
Optimal hierarchical block levels for conventional transforms (MobileNet V1)
Method
Conclusion
Conclusion
14
• We proposed extremely fast and non-
parametric pointwise convolution!
• Especially, DWHT is extremely efficient in
computation because of no multiplication
but addition or subtraction.
• We found the optimal block and hierarchical
block levels for conventional transforms.
THANK YOU
new pointwise convolution in deep neural networks
through extremely fast and non-parametric transforms
15
Joonhyun Jeong l Sung-Ho Bae
Kyung-Hee University

New Pointwise Convolution in Deep Neural Networks through Extremely Fast and Non-parameteric Transforms

  • 1.
    New Pointwise Convolution in DeepNeural Networks through Extremely Fast and Non-parametric Transforms Joonhyun Jeong l Sung-Ho Bae Kyung-Hee University
  • 2.
    01 Background &Motivation 02 Method 03 Conclusion INDEX 2
  • 3.
  • 4.
    Background 4 Standard convolution kernels Spatialspecific convolution kernels Channel specific convolution (pointwise convolution) kernels Depthwise separable convolution Dk * notations N M = number of output channels = number of input channels = filter size
  • 5.
    • But, existingpointwise convolution needs a lot of parameters and FLOPs! • This study is focused on reducing complexity of number of weights and FLOPs needed in pointwise convolution through conventional transforms Motivation 5 others 5% Pointwise Convolution 95% others 25% Pointwise Convolution 75% Ratio of params in MobileNet-V1 Ratio of FLOPs in MobileNet-V1
  • 6.
    Method • Pointwise Convolutionwith conventional transforms • Optimal block structure for conventional transforms • Optimal hierarchical level layers for conventional transforms
  • 7.
    6 METHOD01 Pointwise Convolution(PC) usingconventional transforms Method Discrete Cosine Transform (DCT) kernels Discrete Walsh-Hadamard Transform (DWHT) kernels Conventional pointwise convolution kernels first kernel second kernel third kernel … first kernel second kernel third kernel … first kernel second kernel third kernel
  • 8.
    • Two niceproperties of conventional transforms
 ➢ No learnable parameters needed: no MAC(Memory Access Cost) toward weight parameters.
 ➢ Fast computation version : 
 O(N2 ) O(NlogN) 7 METHOD02 Method • We can construct efficient neural network in the viewpoint of memory space and fast computation.
  • 9.
    • Fast versionof Discrete Walsh Hadamard Transform 8 METHOD03 Method • Fast version of Discrete Cosine Transform : adopted Kok’s fast DCT algorithm (1997) No Multiplication needed! DWHT and DCT can be extremely fast!
  • 10.
    Baseline(ShuffleNet-V2) 9 METHOD04 Optimal block structure forconventional transforms Method * notations : Conventional Transform Pointwise ConvolutionCTPC ReLU after CTPC No ReLU after CTPC
  • 11.
    Optimal block structurefor conventional transforms. 10 METHOD05 Method ➤ Applying ReLU after conventional transform degraded accuracy significantly. * notations : CTPC in block (b) is DCT(b)-DCT : CTPC in block (b) is DCT(b)-DWHT : CTPC in block (c) is DCT(c)-DCT : CTPC in block (c) is DWHT(c)-DWHT
  • 13.
    11 METHOD06 Method Optimal hierarchical blocklevels for conventional transforms (ShuffleNet-V2) * notations : baseline block (a) : (a) blocks in these range are all replaced by our optimal block • High level • Low level • Middle level High-level model 2 High-level model 1 High-level model 3 Mid-level model 2 Mid-level model 1 Mid-level model 3 Low-level model 2 Low-level model 1 Low-level model 3
  • 14.
    11 METHOD07 Method ➤ High-level blocksare favored by the proposed pointwise convolution layer. Optimal hierarchical block levels for conventional transforms (ShuffleNet-V2) High level block Middle level block Low level block
  • 15.
    accuracy increase 1.49%
 79.1%weight reduced 48.4% FLOPs reduced compared to Baseline model!!! 12 METHOD07 Optimal hierarchical block levels for conventional transforms (MobileNet V1) Method
  • 16.
  • 17.
    Conclusion 14 • We proposedextremely fast and non- parametric pointwise convolution! • Especially, DWHT is extremely efficient in computation because of no multiplication but addition or subtraction. • We found the optimal block and hierarchical block levels for conventional transforms.
  • 18.
    THANK YOU new pointwiseconvolution in deep neural networks through extremely fast and non-parametric transforms 15 Joonhyun Jeong l Sung-Ho Bae Kyung-Hee University