1-bit semantic segmentation

1-bit Semantic Segmentation
김정훈

1-bit Semantic Segmentation
Side Project 개발 경험 공유!
//Neural Network Quantization & Inference Acceleration
//Project. Drop the bit
김정훈, AI Robotics KR
https://www.facebook.com/jeounghoon.kim.5
KakaoTalk: @SamuelKJH

발표자 김정훈
.고려대학교 전기전자공학 석사
.제어 로봇 시스템 연구원
.딥러닝 엔지니어 전문연구요원
.관심사: State Estimation & Neural Network
2019 활동 사항:
. Google Developer Group Gwangju DevFest 2019 발표
. 삼성전자 State Estimation with Probabilistic Data Association & Multi-Object Tracking 세미나 (2019.11.26 예정)
. 한국 기술 교육 대학교 온라인 평생 교육원 자문위원
. 삼성 오픈 소스 컨퍼런스 (SOSCON) 2019 심사위원
. Mathworks Advisory Board 2019
. Neural Network Quantization & Compact Network Design 스터디 리더 (구독과.. 좋아요…)
. AI Robotics KR 운영자
함께 연구하고, 이야기 할 사람들을 찾기 위해 커뮤니티 활동을 시작했습니다!
오늘 발표 내용: Neural Network Quantization & Inference Acceleration
3

Project. Drop the bit
Weight & Bias
FP32 à 1 Bit
Small Computing Device!
On-device AI!Neural Network: Heavy & Power Hungry
4

Project. Drop the bit
• 내용 : 인공신경망 모델 경량화 기법과 하드웨어 가속 기법을 이용한 On-Device AI 실현
• 참여자: 김정훈(SW, 고려대학교), 김현우(HW, 한양대학교)
• 진행 방식: SW & HW Collaboration
• 결과물 : On-device AI 구현물,
Stage1: Light Weight End to End Semantic Segmentation
5
Many thanks to HW Kim..

Processing Environments
Jetson Nano
Raspberry Pi
PC/CPU/GPU
ASIC
FPGA
Computing Environments:
CPU, GPU,
Arm Processors,
Neural Processing Unit(NPU),
Dedicated Hardware,
…
6

Neural Network Quantization
FP32 à Lower Bit
Weight & Bias
Lower Volume, Computation Power, Memory Access & Usage,…
Model Compression!
&& Acceleration
(여러 많은 부수적 도움이 있다면)
7

Wu, Shuang, et al. "Training and inference with integers in deep neural networks." ICLR2018, arXiv:1802.04680 (2018).
혹시 8-bit Quantization에 대한 개념적 설명이 필요하신 분은…
https://kr.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html?s_v1=29204&elqem=2890150_EM_KR_19-11_NEWSLETTER_CG-
DIGEST&elqTrackId=a602b18751024fa5a27b1359e4a129a4&elq=f393de06767f4f3ba8bde323d1cf7176&elqaid=29204&elqat=1&elqCampaignId=10302&fbclid=IwAR3MQomYaE8RmG5CICO6ZJ5rP_NXyFfjr18grV82jOA5CLvjXRcbGSC_igE
철저하게 네트워크의 Weight & Bias && Activation에 관한 이야기
8

Architecture Design
Data Analysis &
Pre-processing
Model
Training
Network Analysis
t-SNE Spaces
Weight Histogram
Post-Processing
Workflow
9

사실 정말 소중한 이 기능들… 별 생각 없이 썼던 이 기능들…
10

Network Parameter Comparison – DeepLabV3+
Architecture Weight Count
Maximum Activation (Initial Input Size: 360 x 480 x 3)
bit byte
FP32 659111936 82388992
1bit 20597248 2574656
bit byte
FP32 105,062,400 13132800
1bit 3,283,200 410400
FP32 vs 1 Bit
Total Weight Count
DeepLabV3+ (Baseline: ResNet18)
Width Height
Input
Channel
Output
Channel
Total
7 7 3 64 9408
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
1 1 64 48 3072
1 1 64 128 8192
3 3 64 128 73728
3 3 128 128 147456
3 3 128 128 147456
3 3 128 128 147456
3 3 128 256 294912
1 1 128 256 32768
3 3 256 256 589824
3 3 256 256 589824
3 3 256 256 589824
1 1 256 512 131072
3 3 256 512 1179648
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 256 1179648
3 3 512 256 1179648
3 3 512 256 1179648
1 1 512 256 131072
1 1 1024 256 262144
8 8 256 256 4194304
3 3 304 256 700416
3 3 256 256 589824
1 1 256 11 2816
8 8 11 11 7744
2.574656 MB
82.388992MB
13.1328 MB
0.4104 MB
당연한 얘기지만, 32-bit와 1-bit는 Volume에서 32배 차이
만약 DeepLabV3+ 를 1-bit Quantization한다면?
12

Network Parameter Comparison – XnorNet
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision.
Springer, Cham, 2016.
13

Building Binarized Neural Network Trainer!
•Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep
neural networks with weights and activations constrained to+ 1 or-1.”
arXiv preprint arXiv:1602.02830 (2016).
•Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using
binary convolutional neural networks." European Conference on
Computer Vision. Springer, Cham, 2016.
•Darabi, Sajad, et al. “BNN+: Improved binary network training.” arXiv
preprint arXiv:1812.11800 (2018).
•Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with
improved representational capability and advanced training
algorithm." Proceedings of the European Conference on Computer Vision
(ECCV). 2018.
•Zhou, Shuchang, et al. “Dorefa-net: Training low bitwidth convolutional
neural networks with low bitwidth gradients.” arXiv preprint
arXiv:1606.06160 (2016).
•Jung, Sangil, et al. "Learning to quantize deep networks by optimizing
quantization intervals with task loss." Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2019.
14

class SignumActivation(torch.autograd.Function):
def forward(self, input):
self.save_for_backward(input)
size = input.size()
mean = torch.mean(input.abs(), 1, keepdim=True)
output = input.sign().add(0.01).sign()
return output, mean
def backward(self, grad_output, grad_output_mean): #STE Part
input, = self.saved_tensors
grad_input = grad_output.clone()
grad_input=(2/torch.cosh(input))*(2/torch.cosh(input))*(grad_input)
#grad_input[input.ge(1)] = 0
#great or equal #grad_input[input.le(-1)] = 0 #less or equal
return grad_input
Code Example (PyTorch)
Straight Through Estimator for Gradient Propagation
15

class BinConv2d(nn.Conv2d):
def __init__(self, *kargs, **kwargs):
super(BinConv2d, self).__init__(*kargs, **kwargs)
def forward(self, input):
if not hasattr(self.weight,'fp'):
self.weight.fp=self.weight.data.clone()
self.weight.data=self.weight.fp.sign().add(0.01).sign()
out = nn.functional.conv2d(input, self.weight, None, self.stride, self.padding, self.dilation, self.groups)
if not self.bias is None:
self.bias.fp=self.bias.data.clone()
out += self.bias.view(1, -1, 1, 1).expand_as(out)
return out
Code Example (PyTorch)
FP32 Weight && Qauntized Weight. à Optimization for Minimum Quantization Error
à 결국 Quantization은 Gradient와 Weight를 어떻게 다뤄줄지에 대한 얘기
16

BNN Performance Comparison
Nurvitadhi, Eriko, et al. "Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC." 2016
International Conference on Field-Programmable Technology (FPT). IEEE, 2016.
17

Architecture Study – 1. SegNet
SegNet Architecture (Baseline: VGG16)
Convolution
Batch
Normalization
Activation
Unit
Pool Convolution
Batch
Normalization
Unit
PoolActivation
Signum Activation 뒤의 Pool은 정보 손실이 크기 때문에 Pool 뒤에 Activation을 둔다
18

Architecture Study – 1. SegNet
Global Accuracy Mean Accuracy Mean IoU Weighted IoU Mean BF Score
FP32 0.936334255 0.804027548 0.71979825 0.885348723 0.788193434
1-bit 0.924336215 0.770357729 0.678117835 0.865555414 0.723602791
19

Architecture Study – 2. DeepLabV3+
어려움 & 문제점
1. Residual이 많음 à Hardware에서 Memory 활용 안 좋음
2. Down Sampling이 많음 à BNN 취약
3. Feature의 Precision이 중요한 Network 구조
4. Dilation처럼 Detail이 중요한 Feature에서 Binary Feature는 충분하지 않은 듯 함
5. 나름 열심히 변형해서 학습을 해 보았으나 Accuracy가 나오지 않음…ㅠㅠ
Architecture 조언을 해주실 분을 구합니다….
20

Segmentation Network Architecture Setting!
Network Architecture를 위한
Hardware?
Hardware를 위한
Network Architecture?
Trade-off!!
21

Segmentation Network Architecture Setting!
Convolution
or
Transposed
Convolution
Batch
Normalization
Signum
Activation
Unit
처음 시도하는 Hardware니까, Hardware를 위해 Architecture를 희생! à 처음엔 간단한 구조로 시작해보자!
Architecture 조언을 해주실 분을 구합니다2….
Unit1
360x480x64360x480x3
Encoding Decoding
Unit2 Unit3
180x240x128
Unit4 Unit5 Unit6 Unit7 Unit8 Unit9 Unit10 Unit11
90x120x256 180x240x128 360x480x64 360x480x11
22

Segmentation Result Comparison – SegNet
Iteration Sky Building Pole Road Pavement Tree
Sign
Symbol
Fence Car Pedestrian Bicyclist mIoU BF
FP32
SegNet
80k> 0.896 0.834 0.961 0.877 0.527 0.964 0.622 0.5345 0.321 0.933 0.365 0.6010 0.4684
Ours 60k 0.940 0.649 0.782 0.904 0.891 0.823 0.804 0.750 0.826 0.780 0.723 0.5474 0.5478
GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore
0.82310 0.80665 0.54736 0.74198 0.54778
Accuracy IoU MeanBFScore
Sky 0.94006 0.89583 0.88903
Building 0.64936 0.62527 0.48987
Pole 0.78200 0.17563 0.45098
Road 0.90385 0.88358 0.66327
Pavement 0.89108 0.642277 0.59924
Tree 0.82300 0.721303 0.62532
Sign Symbol 0.80385 0.22311 0.31997
Fence 0.75018 0.43108 0.43870
Car 0.82649 0.646385 0.54331
Pedestrian 0.77986 0.274434 0.44095
Bicyclist 0.72339 0.502063 0.47152
23

Segmentation Result Comparison – SegNet
Ours
SegNet
Ground Truth
Image
24

FINN* based BNN Segmentation HW
• Heterogeneous streaming architecture
• Scalable architecture – configurable SIMD/PE
• Developed using High Level Synthesis (HLS)
* Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
25

Current HW specification
• Target FPGA board: Xilinx ZCU104 (Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC with 504K logic cells) à 갖고있는게 이것 밖에 없어요..ㅠㅠ
• Resource: FF 145479, LUT 321172, BRAM_18K 324
• Performance: 360p (360x480x3) 30 FPS @ 200 MHz
• The longest latency of pipeline stage(conv layer) is 6220805 cycles → 6220805/200000000=0.031 sec
• Performance and resources are scalable (by adjusting SIMD/PE)
Convolution/Transposed convolution logic
Resource and performance analysis of synthesized HW by using Xilinx HLS
Resource utilization of synthesized HW
26

Comparison with ESPNet*
* Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
27

Comparison with ESPNet
ESPNet Ours
Platform NVIDIA Jetson TX2 Xilinx ZCU104
Dataset Cityscape Camvid
Inference speed 6~9 FPS 30 FPS
Operating frequency 828~1300 MHz 200 MHz
DRAM usage 3.52 GB Only for storing input
images. Weight / bias
/ activation are stored
on-chip memory.
3.52 GB
28

On-device “Light Weight Semantic Segmentation”!!
Small & Low Power
Processor
영상은 사실 엄청난 Cherry Picking !!!
360p 화질의 영상을 30fps로 실시간 처리하는 Neural Network Acceleration!
29

어려웠던 점 그리고 앞으로 해야 할 일
• Hardware를 고려하여 Network Architecture 를 정해야 해서... Trade-off를 정하기 힘듦
• BNN은 1-bit Feature를 이용하여 Segmentation을 해야 하며, 이러한 경우 높은 정확도를 갖는 Network를 얻기 힘듦
• 1-bit Drop the bit: GPU 연산 커널 && Arm 환경 가속기로 확장 시키기
• Low-bit Quantized Networks는 기존 Network와 조금 다른 Architecture에 관한 연구가 조금 필요할 듯 함
à BNN Architecture Golden Rule, Hardware-Aware, …
• Hardware-Aware와 관련하여 최근 SqueezeNext* && MobileNetV3(MobileNetEdgeTPU)** 가 그러한 모습을 보여줌
• 다음 프로젝트는 대세를 따라, 그리고 높은 정확도의 네트워크를 위해 Multi-bit Quantization을 고려?
• 1-bit GAN
**https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html?fbclid=IwAR28AznWOPf-NUj_S1P5ZUWwTTlrtTk56HpZA7XpnSjWfICGZ1mBzfspFqU
*https://arxiv.org/abs/1803.10615
30

Collaboration!
대학원 때 하지 못했던, 협업.
다른 분야의 연구원들과 함께 연구하는 경험이 중요함
꼭 연구는 혼자 하는 것이 아님을 깨달았다! 왜 진작 이렇게 못했을까?
앞으로 더 열심히 함께 공부해야겠다!!
31

1-bit semantic segmentation

More Related Content

Similar to 1-bit semantic segmentation

Recently uploaded

1-bit semantic segmentation