1-bit Semantic Segmentation
김정훈
1-bit Semantic Segmentation
Side Project 개발 경험 공유!
//Neural Network Quantization & Inference Acceleration
//Project. Drop the bit
김정훈, AI Robotics KR
https://www.facebook.com/jeounghoon.kim.5
KakaoTalk: @SamuelKJH
발표자 김정훈
.고려대학교 전기전자공학 석사
.제어 로봇 시스템 연구원
.딥러닝 엔지니어 전문연구요원
.관심사: State Estimation & Neural Network
2019 활동 사항:
. Google Developer Group Gwangju DevFest 2019 발표
. 삼성전자 State Estimation with Probabilistic Data Association & Multi-Object Tracking 세미나 (2019.11.26 예정)
. 한국 기술 교육 대학교 온라인 평생 교육원 자문위원
. 삼성 오픈 소스 컨퍼런스 (SOSCON) 2019 심사위원
. Mathworks Advisory Board 2019
. Neural Network Quantization & Compact Network Design 스터디 리더 (구독과.. 좋아요…)
. AI Robotics KR 운영자
함께 연구하고, 이야기 할 사람들을 찾기 위해 커뮤니티 활동을 시작했습니다!
오늘 발표 내용: Neural Network Quantization & Inference Acceleration
3
Project. Drop the bit
Weight & Bias
FP32 à 1 Bit
Small Computing Device!
On-device AI!Neural Network: Heavy & Power Hungry
4
Project. Drop the bit
• 내용 : 인공신경망 모델 경량화 기법과 하드웨어 가속 기법을 이용한 On-Device AI 실현
• 참여자: 김정훈(SW, 고려대학교), 김현우(HW, 한양대학교)
• 진행 방식: SW & HW Collaboration
• 결과물 : On-device AI 구현물,
Stage1: Light Weight End to End Semantic Segmentation
5
Many thanks to HW Kim..
Processing Environments
Jetson Nano
Raspberry Pi
PC/CPU/GPU
ASIC
FPGA
Computing Environments:
CPU, GPU,
Arm Processors,
Neural Processing Unit(NPU),
Dedicated Hardware,
…
6
Neural Network Quantization
FP32 à Lower Bit
Weight & Bias
Lower Volume, Computation Power, Memory Access & Usage,…
Model Compression!
&& Acceleration
(여러 많은 부수적 도움이 있다면)
7
Neural Network Quantization
Wu, Shuang, et al. "Training and inference with integers in deep neural networks." ICLR2018, arXiv:1802.04680 (2018).
혹시 8-bit Quantization에 대한 개념적 설명이 필요하신 분은…
https://kr.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html?s_v1=29204&elqem=2890150_EM_KR_19-11_NEWSLETTER_CG-
DIGEST&elqTrackId=a602b18751024fa5a27b1359e4a129a4&elq=f393de06767f4f3ba8bde323d1cf7176&elqaid=29204&elqat=1&elqCampaignId=10302&fbclid=IwAR3MQomYaE8RmG5CICO6ZJ5rP_NXyFfjr18grV82jOA5CLvjXRcbGSC_igE
철저하게 네트워크의 Weight & Bias && Activation에 관한 이야기
8
Neural Network Quantization
Architecture Design
Data Analysis &
Pre-processing
Model
Training
Network Analysis
t-SNE Spaces
Weight Histogram
Post-Processing
Workflow
9
Neural Network Quantization
사실 정말 소중한 이 기능들… 별 생각 없이 썼던 이 기능들…
10
Binarized Neural Networks
11
Network Parameter Comparison – DeepLabV3+
Architecture Weight Count
Maximum Activation (Initial Input Size: 360 x 480 x 3)
bit byte
FP32 659111936 82388992
1bit 20597248 2574656
bit byte
FP32 105,062,400 13132800
1bit 3,283,200 410400
FP32 vs 1 Bit
Total Weight Count
DeepLabV3+ (Baseline: ResNet18)
Width Height
Input
Channel
Output
Channel
Total
7 7 3 64 9408
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
1 1 64 48 3072
1 1 64 128 8192
3 3 64 128 73728
3 3 128 128 147456
3 3 128 128 147456
3 3 128 128 147456
3 3 128 256 294912
1 1 128 256 32768
3 3 256 256 589824
3 3 256 256 589824
3 3 256 256 589824
1 1 256 512 131072
3 3 256 512 1179648
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 256 1179648
3 3 512 256 1179648
3 3 512 256 1179648
1 1 512 256 131072
1 1 1024 256 262144
8 8 256 256 4194304
3 3 304 256 700416
3 3 256 256 589824
1 1 256 11 2816
8 8 11 11 7744
2.574656 MB
82.388992MB
13.1328 MB
0.4104 MB
당연한 얘기지만, 32-bit와 1-bit는 Volume에서 32배 차이
만약 DeepLabV3+ 를 1-bit Quantization한다면?
12
Network Parameter Comparison – XnorNet
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision.
Springer, Cham, 2016.
13
Building Binarized Neural Network Trainer!
•Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep
neural networks with weights and activations constrained to+ 1 or-1.”
arXiv preprint arXiv:1602.02830 (2016).
•Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using
binary convolutional neural networks." European Conference on
Computer Vision. Springer, Cham, 2016.
•Darabi, Sajad, et al. “BNN+: Improved binary network training.” arXiv
preprint arXiv:1812.11800 (2018).
•Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with
improved representational capability and advanced training
algorithm." Proceedings of the European Conference on Computer Vision
(ECCV). 2018.
•Zhou, Shuchang, et al. “Dorefa-net: Training low bitwidth convolutional
neural networks with low bitwidth gradients.” arXiv preprint
arXiv:1606.06160 (2016).
•Jung, Sangil, et al. "Learning to quantize deep networks by optimizing
quantization intervals with task loss." Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2019.
14
class SignumActivation(torch.autograd.Function):
def forward(self, input):
self.save_for_backward(input)
size = input.size()
mean = torch.mean(input.abs(), 1, keepdim=True)
output = input.sign().add(0.01).sign()
return output, mean
def backward(self, grad_output, grad_output_mean): #STE Part
input, = self.saved_tensors
grad_input = grad_output.clone()
grad_input=(2/torch.cosh(input))*(2/torch.cosh(input))*(grad_input)
#grad_input[input.ge(1)] = 0
#great or equal #grad_input[input.le(-1)] = 0 #less or equal
return grad_input
Code Example (PyTorch)
Straight Through Estimator for Gradient Propagation
15
class BinConv2d(nn.Conv2d):
def __init__(self, *kargs, **kwargs):
super(BinConv2d, self).__init__(*kargs, **kwargs)
def forward(self, input):
if not hasattr(self.weight,'fp'):
self.weight.fp=self.weight.data.clone()
self.weight.data=self.weight.fp.sign().add(0.01).sign()
out = nn.functional.conv2d(input, self.weight, None, self.stride, self.padding, self.dilation, self.groups)
if not self.bias is None:
self.bias.fp=self.bias.data.clone()
out += self.bias.view(1, -1, 1, 1).expand_as(out)
return out
Code Example (PyTorch)
FP32 Weight && Qauntized Weight. à Optimization for Minimum Quantization Error
à 결국 Quantization은 Gradient와 Weight를 어떻게 다뤄줄지에 대한 얘기
16
BNN Performance Comparison
Nurvitadhi, Eriko, et al. "Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC." 2016
International Conference on Field-Programmable Technology (FPT). IEEE, 2016.
17
Architecture Study – 1. SegNet
SegNet Architecture (Baseline: VGG16)
Convolution
Batch
Normalization
Activation
Unit
Pool Convolution
Batch
Normalization
Unit
PoolActivation
Signum Activation 뒤의 Pool은 정보 손실이 크기 때문에 Pool 뒤에 Activation을 둔다
18
Architecture Study – 1. SegNet
Global Accuracy Mean Accuracy Mean IoU Weighted IoU Mean BF Score
FP32 0.936334255 0.804027548 0.71979825 0.885348723 0.788193434
1-bit 0.924336215 0.770357729 0.678117835 0.865555414 0.723602791
19
Architecture Study – 2. DeepLabV3+
어려움 & 문제점
1. Residual이 많음 à Hardware에서 Memory 활용 안 좋음
2. Down Sampling이 많음 à BNN 취약
3. Feature의 Precision이 중요한 Network 구조
4. Dilation처럼 Detail이 중요한 Feature에서 Binary Feature는 충분하지 않은 듯 함
5. 나름 열심히 변형해서 학습을 해 보았으나 Accuracy가 나오지 않음…ㅠㅠ
Architecture 조언을 해주실 분을 구합니다….
20
Segmentation Network Architecture Setting!
Network Architecture를 위한
Hardware?
Hardware를 위한
Network Architecture?
Trade-off!!
21
Segmentation Network Architecture Setting!
Convolution
or
Transposed
Convolution
Batch
Normalization
Signum
Activation
Unit
처음 시도하는 Hardware니까, Hardware를 위해 Architecture를 희생! à 처음엔 간단한 구조로 시작해보자!
Architecture 조언을 해주실 분을 구합니다2….
Unit1
360x480x64360x480x3
Encoding Decoding
Unit2 Unit3
180x240x128
Unit4 Unit5 Unit6 Unit7 Unit8 Unit9 Unit10 Unit11
90x120x256 180x240x128 360x480x64 360x480x11
22
Segmentation Result Comparison – SegNet
Iteration Sky Building Pole Road Pavement Tree
Sign
Symbol
Fence Car Pedestrian Bicyclist mIoU BF
FP32
SegNet
80k> 0.896 0.834 0.961 0.877 0.527 0.964 0.622 0.5345 0.321 0.933 0.365 0.6010 0.4684
Ours 60k 0.940 0.649 0.782 0.904 0.891 0.823 0.804 0.750 0.826 0.780 0.723 0.5474 0.5478
GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore
0.82310 0.80665 0.54736 0.74198 0.54778
Accuracy IoU MeanBFScore
Sky 0.94006 0.89583 0.88903
Building 0.64936 0.62527 0.48987
Pole 0.78200 0.17563 0.45098
Road 0.90385 0.88358 0.66327
Pavement 0.89108 0.642277 0.59924
Tree 0.82300 0.721303 0.62532
Sign Symbol 0.80385 0.22311 0.31997
Fence 0.75018 0.43108 0.43870
Car 0.82649 0.646385 0.54331
Pedestrian 0.77986 0.274434 0.44095
Bicyclist 0.72339 0.502063 0.47152
23
Segmentation Result Comparison – SegNet
Ours
SegNet
Ground Truth
Image
24
FINN* based BNN Segmentation HW
• Heterogeneous streaming architecture
• Scalable architecture – configurable SIMD/PE
• Developed using High Level Synthesis (HLS)
* Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
25
Current HW specification
• Target FPGA board: Xilinx ZCU104 (Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC with 504K logic cells) à 갖고있는게 이것 밖에 없어요..ㅠㅠ
• Resource: FF 145479, LUT 321172, BRAM_18K 324
• Performance: 360p (360x480x3) 30 FPS @ 200 MHz
• The longest latency of pipeline stage(conv layer) is 6220805 cycles → 6220805/200000000=0.031 sec
• Performance and resources are scalable (by adjusting SIMD/PE)
Convolution/Transposed convolution logic
Resource and performance analysis of synthesized HW by using Xilinx HLS
Resource utilization of synthesized HW
26
Comparison with ESPNet*
* Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
27
Comparison with ESPNet
ESPNet Ours
Platform NVIDIA Jetson TX2 Xilinx ZCU104
Dataset Cityscape Camvid
Inference speed 6~9 FPS 30 FPS
Operating frequency 828~1300 MHz 200 MHz
DRAM usage 3.52 GB Only for storing input
images. Weight / bias
/ activation are stored
on-chip memory.
3.52 GB
28
On-device “Light Weight Semantic Segmentation”!!
Small & Low Power
Processor
영상은 사실 엄청난 Cherry Picking !!!
360p 화질의 영상을 30fps로 실시간 처리하는 Neural Network Acceleration!
29
어려웠던 점 그리고 앞으로 해야 할 일
• Hardware를 고려하여 Network Architecture 를 정해야 해서... Trade-off를 정하기 힘듦
• BNN은 1-bit Feature를 이용하여 Segmentation을 해야 하며, 이러한 경우 높은 정확도를 갖는 Network를 얻기 힘듦
• 1-bit Drop the bit: GPU 연산 커널 && Arm 환경 가속기로 확장 시키기
• Low-bit Quantized Networks는 기존 Network와 조금 다른 Architecture에 관한 연구가 조금 필요할 듯 함
à BNN Architecture Golden Rule, Hardware-Aware, …
• Hardware-Aware와 관련하여 최근 SqueezeNext* && MobileNetV3(MobileNetEdgeTPU)** 가 그러한 모습을 보여줌
• 다음 프로젝트는 대세를 따라, 그리고 높은 정확도의 네트워크를 위해 Multi-bit Quantization을 고려?
• 1-bit GAN
**https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html?fbclid=IwAR28AznWOPf-NUj_S1P5ZUWwTTlrtTk56HpZA7XpnSjWfICGZ1mBzfspFqU
*https://arxiv.org/abs/1803.10615
30
Collaboration!
대학원 때 하지 못했던, 협업.
다른 분야의 연구원들과 함께 연구하는 경험이 중요함
꼭 연구는 혼자 하는 것이 아님을 깨달았다! 왜 진작 이렇게 못했을까?
앞으로 더 열심히 함께 공부해야겠다!!
31
Thanks again to HW Kim

1-bit semantic segmentation

  • 1.
  • 2.
    1-bit Semantic Segmentation SideProject 개발 경험 공유! //Neural Network Quantization & Inference Acceleration //Project. Drop the bit 김정훈, AI Robotics KR https://www.facebook.com/jeounghoon.kim.5 KakaoTalk: @SamuelKJH
  • 3.
    발표자 김정훈 .고려대학교 전기전자공학석사 .제어 로봇 시스템 연구원 .딥러닝 엔지니어 전문연구요원 .관심사: State Estimation & Neural Network 2019 활동 사항: . Google Developer Group Gwangju DevFest 2019 발표 . 삼성전자 State Estimation with Probabilistic Data Association & Multi-Object Tracking 세미나 (2019.11.26 예정) . 한국 기술 교육 대학교 온라인 평생 교육원 자문위원 . 삼성 오픈 소스 컨퍼런스 (SOSCON) 2019 심사위원 . Mathworks Advisory Board 2019 . Neural Network Quantization & Compact Network Design 스터디 리더 (구독과.. 좋아요…) . AI Robotics KR 운영자 함께 연구하고, 이야기 할 사람들을 찾기 위해 커뮤니티 활동을 시작했습니다! 오늘 발표 내용: Neural Network Quantization & Inference Acceleration 3
  • 4.
    Project. Drop thebit Weight & Bias FP32 à 1 Bit Small Computing Device! On-device AI!Neural Network: Heavy & Power Hungry 4
  • 5.
    Project. Drop thebit • 내용 : 인공신경망 모델 경량화 기법과 하드웨어 가속 기법을 이용한 On-Device AI 실현 • 참여자: 김정훈(SW, 고려대학교), 김현우(HW, 한양대학교) • 진행 방식: SW & HW Collaboration • 결과물 : On-device AI 구현물, Stage1: Light Weight End to End Semantic Segmentation 5 Many thanks to HW Kim..
  • 6.
    Processing Environments Jetson Nano RaspberryPi PC/CPU/GPU ASIC FPGA Computing Environments: CPU, GPU, Arm Processors, Neural Processing Unit(NPU), Dedicated Hardware, … 6
  • 7.
    Neural Network Quantization FP32à Lower Bit Weight & Bias Lower Volume, Computation Power, Memory Access & Usage,… Model Compression! && Acceleration (여러 많은 부수적 도움이 있다면) 7
  • 8.
    Neural Network Quantization Wu,Shuang, et al. "Training and inference with integers in deep neural networks." ICLR2018, arXiv:1802.04680 (2018). 혹시 8-bit Quantization에 대한 개념적 설명이 필요하신 분은… https://kr.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html?s_v1=29204&elqem=2890150_EM_KR_19-11_NEWSLETTER_CG- DIGEST&elqTrackId=a602b18751024fa5a27b1359e4a129a4&elq=f393de06767f4f3ba8bde323d1cf7176&elqaid=29204&elqat=1&elqCampaignId=10302&fbclid=IwAR3MQomYaE8RmG5CICO6ZJ5rP_NXyFfjr18grV82jOA5CLvjXRcbGSC_igE 철저하게 네트워크의 Weight & Bias && Activation에 관한 이야기 8
  • 9.
    Neural Network Quantization ArchitectureDesign Data Analysis & Pre-processing Model Training Network Analysis t-SNE Spaces Weight Histogram Post-Processing Workflow 9
  • 10.
    Neural Network Quantization 사실정말 소중한 이 기능들… 별 생각 없이 썼던 이 기능들… 10
  • 11.
  • 12.
    Network Parameter Comparison– DeepLabV3+ Architecture Weight Count Maximum Activation (Initial Input Size: 360 x 480 x 3) bit byte FP32 659111936 82388992 1bit 20597248 2574656 bit byte FP32 105,062,400 13132800 1bit 3,283,200 410400 FP32 vs 1 Bit Total Weight Count DeepLabV3+ (Baseline: ResNet18) Width Height Input Channel Output Channel Total 7 7 3 64 9408 3 3 64 64 36864 3 3 64 64 36864 3 3 64 64 36864 3 3 64 64 36864 1 1 64 48 3072 1 1 64 128 8192 3 3 64 128 73728 3 3 128 128 147456 3 3 128 128 147456 3 3 128 128 147456 3 3 128 256 294912 1 1 128 256 32768 3 3 256 256 589824 3 3 256 256 589824 3 3 256 256 589824 1 1 256 512 131072 3 3 256 512 1179648 3 3 512 512 2359296 3 3 512 512 2359296 3 3 512 512 2359296 3 3 512 256 1179648 3 3 512 256 1179648 3 3 512 256 1179648 1 1 512 256 131072 1 1 1024 256 262144 8 8 256 256 4194304 3 3 304 256 700416 3 3 256 256 589824 1 1 256 11 2816 8 8 11 11 7744 2.574656 MB 82.388992MB 13.1328 MB 0.4104 MB 당연한 얘기지만, 32-bit와 1-bit는 Volume에서 32배 차이 만약 DeepLabV3+ 를 1-bit Quantization한다면? 12
  • 13.
    Network Parameter Comparison– XnorNet Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016. 13
  • 14.
    Building Binarized NeuralNetwork Trainer! •Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1.” arXiv preprint arXiv:1602.02830 (2016). •Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016. •Darabi, Sajad, et al. “BNN+: Improved binary network training.” arXiv preprint arXiv:1812.11800 (2018). •Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm." Proceedings of the European Conference on Computer Vision (ECCV). 2018. •Zhou, Shuchang, et al. “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.” arXiv preprint arXiv:1606.06160 (2016). •Jung, Sangil, et al. "Learning to quantize deep networks by optimizing quantization intervals with task loss." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. 14
  • 15.
    class SignumActivation(torch.autograd.Function): def forward(self,input): self.save_for_backward(input) size = input.size() mean = torch.mean(input.abs(), 1, keepdim=True) output = input.sign().add(0.01).sign() return output, mean def backward(self, grad_output, grad_output_mean): #STE Part input, = self.saved_tensors grad_input = grad_output.clone() grad_input=(2/torch.cosh(input))*(2/torch.cosh(input))*(grad_input) #grad_input[input.ge(1)] = 0 #great or equal #grad_input[input.le(-1)] = 0 #less or equal return grad_input Code Example (PyTorch) Straight Through Estimator for Gradient Propagation 15
  • 16.
    class BinConv2d(nn.Conv2d): def __init__(self,*kargs, **kwargs): super(BinConv2d, self).__init__(*kargs, **kwargs) def forward(self, input): if not hasattr(self.weight,'fp'): self.weight.fp=self.weight.data.clone() self.weight.data=self.weight.fp.sign().add(0.01).sign() out = nn.functional.conv2d(input, self.weight, None, self.stride, self.padding, self.dilation, self.groups) if not self.bias is None: self.bias.fp=self.bias.data.clone() out += self.bias.view(1, -1, 1, 1).expand_as(out) return out Code Example (PyTorch) FP32 Weight && Qauntized Weight. à Optimization for Minimum Quantization Error à 결국 Quantization은 Gradient와 Weight를 어떻게 다뤄줄지에 대한 얘기 16
  • 17.
    BNN Performance Comparison Nurvitadhi,Eriko, et al. "Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC." 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 2016. 17
  • 18.
    Architecture Study –1. SegNet SegNet Architecture (Baseline: VGG16) Convolution Batch Normalization Activation Unit Pool Convolution Batch Normalization Unit PoolActivation Signum Activation 뒤의 Pool은 정보 손실이 크기 때문에 Pool 뒤에 Activation을 둔다 18
  • 19.
    Architecture Study –1. SegNet Global Accuracy Mean Accuracy Mean IoU Weighted IoU Mean BF Score FP32 0.936334255 0.804027548 0.71979825 0.885348723 0.788193434 1-bit 0.924336215 0.770357729 0.678117835 0.865555414 0.723602791 19
  • 20.
    Architecture Study –2. DeepLabV3+ 어려움 & 문제점 1. Residual이 많음 à Hardware에서 Memory 활용 안 좋음 2. Down Sampling이 많음 à BNN 취약 3. Feature의 Precision이 중요한 Network 구조 4. Dilation처럼 Detail이 중요한 Feature에서 Binary Feature는 충분하지 않은 듯 함 5. 나름 열심히 변형해서 학습을 해 보았으나 Accuracy가 나오지 않음…ㅠㅠ Architecture 조언을 해주실 분을 구합니다…. 20
  • 21.
    Segmentation Network ArchitectureSetting! Network Architecture를 위한 Hardware? Hardware를 위한 Network Architecture? Trade-off!! 21
  • 22.
    Segmentation Network ArchitectureSetting! Convolution or Transposed Convolution Batch Normalization Signum Activation Unit 처음 시도하는 Hardware니까, Hardware를 위해 Architecture를 희생! à 처음엔 간단한 구조로 시작해보자! Architecture 조언을 해주실 분을 구합니다2…. Unit1 360x480x64360x480x3 Encoding Decoding Unit2 Unit3 180x240x128 Unit4 Unit5 Unit6 Unit7 Unit8 Unit9 Unit10 Unit11 90x120x256 180x240x128 360x480x64 360x480x11 22
  • 23.
    Segmentation Result Comparison– SegNet Iteration Sky Building Pole Road Pavement Tree Sign Symbol Fence Car Pedestrian Bicyclist mIoU BF FP32 SegNet 80k> 0.896 0.834 0.961 0.877 0.527 0.964 0.622 0.5345 0.321 0.933 0.365 0.6010 0.4684 Ours 60k 0.940 0.649 0.782 0.904 0.891 0.823 0.804 0.750 0.826 0.780 0.723 0.5474 0.5478 GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore 0.82310 0.80665 0.54736 0.74198 0.54778 Accuracy IoU MeanBFScore Sky 0.94006 0.89583 0.88903 Building 0.64936 0.62527 0.48987 Pole 0.78200 0.17563 0.45098 Road 0.90385 0.88358 0.66327 Pavement 0.89108 0.642277 0.59924 Tree 0.82300 0.721303 0.62532 Sign Symbol 0.80385 0.22311 0.31997 Fence 0.75018 0.43108 0.43870 Car 0.82649 0.646385 0.54331 Pedestrian 0.77986 0.274434 0.44095 Bicyclist 0.72339 0.502063 0.47152 23
  • 24.
    Segmentation Result Comparison– SegNet Ours SegNet Ground Truth Image 24
  • 25.
    FINN* based BNNSegmentation HW • Heterogeneous streaming architecture • Scalable architecture – configurable SIMD/PE • Developed using High Level Synthesis (HLS) * Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. 25
  • 26.
    Current HW specification •Target FPGA board: Xilinx ZCU104 (Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC with 504K logic cells) à 갖고있는게 이것 밖에 없어요..ㅠㅠ • Resource: FF 145479, LUT 321172, BRAM_18K 324 • Performance: 360p (360x480x3) 30 FPS @ 200 MHz • The longest latency of pipeline stage(conv layer) is 6220805 cycles → 6220805/200000000=0.031 sec • Performance and resources are scalable (by adjusting SIMD/PE) Convolution/Transposed convolution logic Resource and performance analysis of synthesized HW by using Xilinx HLS Resource utilization of synthesized HW 26
  • 27.
    Comparison with ESPNet* *Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018. 27
  • 28.
    Comparison with ESPNet ESPNetOurs Platform NVIDIA Jetson TX2 Xilinx ZCU104 Dataset Cityscape Camvid Inference speed 6~9 FPS 30 FPS Operating frequency 828~1300 MHz 200 MHz DRAM usage 3.52 GB Only for storing input images. Weight / bias / activation are stored on-chip memory. 3.52 GB 28
  • 29.
    On-device “Light WeightSemantic Segmentation”!! Small & Low Power Processor 영상은 사실 엄청난 Cherry Picking !!! 360p 화질의 영상을 30fps로 실시간 처리하는 Neural Network Acceleration! 29
  • 30.
    어려웠던 점 그리고앞으로 해야 할 일 • Hardware를 고려하여 Network Architecture 를 정해야 해서... Trade-off를 정하기 힘듦 • BNN은 1-bit Feature를 이용하여 Segmentation을 해야 하며, 이러한 경우 높은 정확도를 갖는 Network를 얻기 힘듦 • 1-bit Drop the bit: GPU 연산 커널 && Arm 환경 가속기로 확장 시키기 • Low-bit Quantized Networks는 기존 Network와 조금 다른 Architecture에 관한 연구가 조금 필요할 듯 함 à BNN Architecture Golden Rule, Hardware-Aware, … • Hardware-Aware와 관련하여 최근 SqueezeNext* && MobileNetV3(MobileNetEdgeTPU)** 가 그러한 모습을 보여줌 • 다음 프로젝트는 대세를 따라, 그리고 높은 정확도의 네트워크를 위해 Multi-bit Quantization을 고려? • 1-bit GAN **https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html?fbclid=IwAR28AznWOPf-NUj_S1P5ZUWwTTlrtTk56HpZA7XpnSjWfICGZ1mBzfspFqU *https://arxiv.org/abs/1803.10615 30
  • 31.
    Collaboration! 대학원 때 하지못했던, 협업. 다른 분야의 연구원들과 함께 연구하는 경험이 중요함 꼭 연구는 혼자 하는 것이 아님을 깨달았다! 왜 진작 이렇게 못했을까? 앞으로 더 열심히 함께 공부해야겠다!! 31
  • 32.