Cvpr 2018 papers review (efficient computing)

CVPR 2018 Papers
논문 간단 리뷰
강동현

NetAdapt: Platform Aware Neural Network Adaptation
for Mobile Applications (Google)
• 일반적인 최적화  MACs / FLOPs 등을 줄이는 데에 집중함
• 실제로 Latency, Energy consumption 등과 같은 direct metrics도 최적화 되는가? (그렇지 않을
수도 있다)  이 부분을 고려해서 최적화 하겠다!
• Empirical measurements
• Contribution
• Automatically and progressively simply a pre-trained network until the resource budget is
met while maximizing the accuracy
• Achieves better accuracy versus latency trade-offs on mobile CPU & GPU, compared with
the state-of-the-art automated network simplification algorithms
• Method
• 한 번에 주어진 constraints를 맞추려 하는 것이 아니라, iterative하게 조건을 점점 더 tight하게
만들어 가면서 정확도 최적화를 진행
• 1 step당 constraint를 만족시키면서 가장 acc drop이 낮은 layer의 필터 수를 조정하는 방식
• 느리다

Algorithm Details
• Empirical Measurements
• Layer 별로 look-up table 생성해 둬서 시간을 최대한 절약한다.
• Choose which Filter
• L2-norm magnitude 작은 순서대로 제거한다.
• Joint influence 계산해서 지우는 방법도 있을 것*
• Fine-tuning
• Short-term fine-tuning으로 대충 성능 비교 후 최종 결과에 대해서만 Long-term 으로 진행
• Short-term training: about 40k iteration, w/ ImageNet training set – 10,000 holdout set
*Yang, Tien-Ju and Chen, Yu-Hsin and Sze, Vivienne: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). (2017)

ADC: Automated Deep Compression and Acceleration
with Reinforcement Learning (Song Han)
• NetAdapt’s competitor
• LPIRC: Google  Achieve better accuracy
than ADC & practical
• Efficient DL workshop: Song Han 
NetAdapt is slow!
• Reinforcement Learning based agent
• Efficient design space exploration
• Accuracy & compression rate
• Sample the design space  greatly improve
the model compression quality
• Even better than human expertise!

ADC Agent
• w/ continuous compression ratio control (DDPG*)
• Receive a reward with approximated model
performance without fine-tuning
• Accuracy & overall compression rate
• Further scenario: FLOPs-constrained compression &
accuracy-guaranteed compression
• Process a network in a layer by layer manner
• Input: Layer embedding state 𝑠𝑡 =
• Outputs a fine grained sparsity ratio for each layer
* N. Johnson, S. Kotz, and N. Balakrishnan. Continuous univariate probability distributions,(vol. 1), 1994.

Algorithm
• Specified Compression algorithm (reducing channels to c’): n x c x k x k  ?
• Spatial decomposition[1]: n x c’ x k x 1, c’ x c x 1 x k - Data independent reconstruction
• Channel decomposition[2]: n x c’ x k x k, c’ x c x 1 x 1
• Channel pruning[3]: n x c’ x k x k - L2-norm(magnitude) based pruning
• Agent
• Each transition in an episode is 𝑠𝑡, 𝑎 𝑡, 𝑅, 𝑠𝑡+1
• Action  Error[4]에 비례한 Reward를 통해 Agent 학습
• FLOPs-Constrained Compression
• R = -Error
• 일단 1차로 네트워크 압축 후, 휴리스틱을 통해 점차 주어진 budget 아래로 압축되도록 만든다.
• Accuracy-Guaranteed Compression
• Observe that accuracy error is inversely-proportional to log(FLOPs)
• R = - Error * log(FLOPs)
[1] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014
[2] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and
machine intelligence, 38(10):1943–1955, 2016.
[3] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1389–1397, 2017
[4]B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning

Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference (Google, CVPR 2018)
• How to train Quantized Neural Networks?
• 이전까지의 Quantization approach:
• 너무 쉬운 문제들에 대해서만 접근하는 경향이 있다 (Alexnet, ResNet, VGG)
• All over-parameterized
• Compression에 대해서만 생각하고 Computational efficiency는 고려하지 않았다.
• Look-up table 방식: Poorly perform on common devices
• Shift / XOR 등 bitwise operation 사용하는 애들은 Existing hardware에서 딱히 이득이 없다.
• Fully XOR Net 같은 경우는 performance degradation 문제가 있다
• Quantization scheme
• Weights / Activations: 8-bit integers
• bias vectors: 32-bit integers
• Quantized inference / training Framework
• Adopted in TFLite (Inference)
• Inference: Integer-only arithmetic / training: floating-point arithmetic

Quantized Inference
• affine mapping
• 𝑟 = 𝑆(𝑞 − 𝑍)
• Integers q to real numbers r  S, Z are quantization parameters
• Uses a single set of quantization parameters for all values within
each activations array and within each weights array
• Computation of Matrix multiplication
• 𝑟3 = 𝑟1 𝑟2일 때 (𝑟𝛼: N x N matrix)
• 𝑟3
(𝑖,𝑘)
= 𝑗=1
𝑁
𝑟1
(𝑖,𝑗)
𝑟2
(𝑗,𝑘)


Quantized Inference
• Bias quantization
• Bias quantization error  act as an overall bias
• 32-bit representation
• 𝑍 𝑏𝑖𝑎𝑠 = 0, 𝑆 𝑏𝑖𝑎𝑠 = 𝑆1 ∗ 𝑆2
• Things left to do
• Scale down to the final scale (8-bit output activations)
• Cast down to uint8
• apply the activation function to yield the final 8-bit output activation

Training with simulated quantization
• All weights & biases are stored in floating point
• Weights are quantized before they are convolved with the input
• Activations are quantized at points where they would be during inference
• Tuning quantization parameters
• Weight: min value ~ max value  linearization
• Activation: Exponential moving averages

SBNet: Sparse Blocks Network for Fast
Inference (Uber)
• Low-cost computation mask  reduce computation in the
high-resolution main network
• Tiling-based sparse convolution algorithm
• Implements tiling-based GPU kernel
• LiDAR 3D object detection tasks

Sparse Blocks Network
• How to handle sparse input?
• Mask to indices
• Extract a list of activate
location indices
• Sparse gather/scatter
• Extract data from the sparse
inputs
• Signal processing
• Overlap-save algorithm
• Repeating Gathering /
Scattering while processing

Shift: A Zero FLOP, Zero Parameter Alternative to
Spatial Convolutions (UC Berkeley, Kurt Keutzer)
• Shift-based module
• Use Shift operation to mix spatial
information across channels
• Let’s use simple shift operation
instead of depth-wise convolution!
• Series of memory operations that
adjusts channels of the input tensor in
certain directions
• Assign different shift kernels per
each channel
• 𝑘2
different shift kernels
• Each group of 𝑀/𝑘2
channels adopts
one shift
• Results
• It looks not that efficient
• But it can be adapted to MIDAP easily

Shift based modules
• (Shift-)Conv-Shift-Conv module
• 𝑆𝐶2 module / CSC module
• Shift Kernel
• Size 𝐷 𝑘: 𝐷 𝑘
2
possible shift matrices
• Dilation rate: similar to dilated convolution
• Expansion rate 𝜀: expand the channel size via 1x1
convolution kernel to gather sufficient information
with shift operation
• Only 1x1 convolutions
• Target
• Mobile / IOT applications
• Memory footprint reduction

Squeeze-and-Excitation Networks
(Momenta & Oxford)
• 1st place winner of ILSVRC 2017 classification
• Suggests SE block
• Feature recalibration
• Squeeze: Global average pooling (H x W  1 x 1)
• Excitation: Adaptive Recalibration (capture channel-wise dependencies)

Squeeze & Excitation
• Excitation
• Gating mechanism with two fully connected
layers
• Acts similarly as an attention module
• Results

ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices (Megvii Inc.)
• Simple idea
• State-of-the-art architectures
• 1x1 conv + DWconv + 1x1 conv
• Intuitive shuffling
• 1x1 group conv + shuffle +
DWconv + 1x1 group conv
• g x n outputs (g: # of groups) 
(g,n)  transpose (n, g) 
flattening g x n
• Good results

CondenseNet: An Efficient DenseNet using
Learned Group Convolutions (Cornell Univ.)
• Observation
• 1x1 group convolution usually leads to drastic
reductions in accuracy
• Learned group convolution
• Removing superfluous computation in
DenseNet architecture via group convolution
• Automatic input feature groupings during
training

CondenseNet Training
• Split the filters into G groups of equal size before training
• Random grouping for further condensation
• Condensation Criterion
• Averaged absolute value of weights between them across all outputs within the group
• Group Lasso
• Group-level sparsity
• Condensation procedure
• Condensation factor C
• C – 1 condensing stages
• Pruning 1/C of the filter weights at the end of each stage
• Re-index the layer

Stochastic Downsampling for Cost-Adjustable Inference and
Improved Regularization in Convolutional Networks
(Nanyang Technological University & Adobe & Nvidia)
• Training the network w/ stochastic downsampling

Efficient video object segmentation via
Network Modulation (Snap)
• Semi-supervised video segmentation
• A human can easily segment an object in the whole
video without knowing its semantic meaning
• Typical scenario
• Given: First frame of a video along with an annotated object
mask
• Task: to accurately locate the object in all following frames
• Modulator + segmentation network
• 기존: FCN pre-training + fine-tuning the network for
specific video sequence
• Fine-tuning 과정 비효율적
• Proposed: Segmentation network 는 1번만 트레이닝하고,
주어진 태스크에 맞는 modulator 트레이닝하자
• One-shot fine-tuning (One-shot learning == meta-learning 응용)
• Visual modulator(Attention), Spatial modulator

Efficient video object segmentation via
Network Modulation (Snap)

Mobile Video Object Detection with Temporally-Aware
Feature Maps (Georgia Tech, Google)
• Video object detection
• Imagenet VID 2015 dataset
• Single image object detector + LSTM
• LSTM layers to create an interweaved recurrent-
convolutional architecture
• Bottleneck-LSTM to reduce computational cost
• 15 FPS in Mobile CPU
• Smaller and faster than DFF(Deep Feature Flow)
• This work does not use optical flow estimation

Approach
• SSD + Convolutional LSTMs
• Mobilenet-SSD, Removing the final layer
• Inject convolutional LSTM layers directly into the single-
frame detector
• Allow the network to encode both spatial and temporal
information
• Feature refinement with LSTMs
• Place a single LSTM after the Conv13 layer
• Stack multiple LSTMs after the Conv13 layer
• Place one LSTM after each feature map

Towards High Performance Video Object
Detection (USTC, Microsoft Research)
• Recent works
• Motion estimation module is built into the network architecture
• Sparse feature propagation
• Expensive feature network on sparse key frames
• Motion field
• Dense feature aggregation
• Utilize every frame to enhance accuracy
• This paper suggests unified approach
• Sparsely recursive feature aggregation
• Spatially-adaptive partial feature updating
• To recompute features on non-key frames
• wherever propagated features have bad quality
• Temporally-adaptive key frame scheduling
• Dynamic key frame scheduling

Low-shot Learning with Imprinted Weights (UCLA)
• How to recognize novel visual categories?
• Given base classes w/ abundant samples for training
• Exposed to previously unseen novel classes with a limited amount of training data
for each category
• Directly set weights for a new category based on an appropriately
scaled copy of the embedding layer activations for that training
example
• Human’s ability to accept the new visual categories  learner grows its capability
as it encounters more categories and training samples
• A single imprinted weight vector is learned for each novel category

Metric Learning
• Proxy-based Embedding Training
• Previous works: Neighborhood components
analysis – learns a distance metric
• Comparison with all other classes
• Proxy-based training
• Comparison with other negative-correlated proxies
• Trainable proxies
• I cannot understand this concept exactly 
• Imprinting
• Remembering the semantic embedding of low-
shot examples as the templates for new classes

Memory Matching Networks for One-Shot Image
Recognition (USTC, Microsoft)
• Writes the features of a set of labelled images into memory
• Reads from memory when performing inference
• A Contextual Learner employs the memory slots in a sequential manner to predict the parameters of
CNNs for unlabeled images
• MM-Net could output one unified model irrespective of the number of shots and
categories

One-Shot Image recognition
• Given an unlabeled image 𝑥, predict its class 𝑦
• 𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑦𝑛 𝑥, 𝑆), 𝑤ℎ𝑒𝑟𝑒 𝑃 𝑦𝑛 𝑥, 𝑠 = 𝑓 𝑥 𝑆 T
∙ 𝑔 𝑥 𝑛
𝑆
• Different embedding function for unlabeled image and support image
• 𝑥 𝑛: 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 𝑜𝑓 𝑙𝑎𝑏𝑒𝑙 𝑦 𝑛
• Design a memory module to encode the contextual information within
support set into the memory via write controller
• Memory: consist of M key-value pairs
• Key: 𝐷 𝑚-dimensional memory representation
• Value: class label
• Write controller
• Encode the sequence of N support images into M memory slots
• Aiming to distill the intrinsic characteristics of classes
• Contextual Embedding
• For support set / Unlabeled image
• bi-LSTM-based approach

Feature Generating Networks for Zero-Shot Learning
(Saarland Informatics Campus)
• How to cope with unseen classes? (Zero-shot learning task)
• Use GAN to synthesize features of unseen classes
• Use class-level semantic information

Dual Skipping Networks (Fudan Univ, Tencent AI)
• Inspired by neuroscience studies
• Coarse-to-fine object categorization
• Mimicking the behavior of human brain
• LH(Fine grain) & RH(Coarse grain)
• Propose a layer-skipping mechanism
• Learns a gating network to predict which layers to
skip
• E

Model
• Network has left-right subnets by referring to
LH and RH
• At first, both branches have roughly the same
initialized layers and structures
• Skip-Dense Block
• Dense Layer – Residual or DenseNet based block
• Gating network
• Path selection
• Whether or not skipping the convolutional layer from the
training data
• Threshold function of Gating network
• Performs as a binary classifier
• Training: act as a scale value
• Testing: discrete binary value (0: skip)
• Guide
• Faster coarse subnet can guide the slower fine/local
subnet

Deep Mutual Learning
(Dalian University of Technology, China)
• Model distillation
• A powerful large network teaches a small network
• Deep Mutual learning
• An ensemble of students learn collaboratively & teach each other
• Collaborative learning
• Dual learning[1]: two cross-lingual translation models teach each other
• Cooperative Learning[2]: Recognizing the same set of object categories but with
different inputs (ex: RGB + depth)
• This work: different models, but the same input and task
• No priori powerful teacher network is necessary!
[1] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. Dual learning for machine translation. In NIPS, pages 820– 828, 2016.
[2] T. Batra and D. Parikh. Cooperative learning with visual attributes. arXiv: 1705.05512, 2017.

Deep Mutual Learning
• Use KL Divergence to provide training experience to each other network
• 𝐷 𝐾𝐿(𝑝2| 𝑝1 = 𝑖=1
𝑁
𝑚=1
𝑀
𝑝2
𝑚
𝑥𝑖 𝑙𝑜𝑔
𝑝2
𝑚 𝑥 𝑖
𝑝1
𝑚(𝑥 𝑖)
(𝑁: # 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠, 𝑀: # 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠, 𝑝 𝑛: 𝑜𝑢𝑡𝑝𝑢𝑡 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑒𝑟 𝑜𝑓 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 𝜃 𝑛)
• Loss function: 𝐿 𝜃 𝑘
= 𝐿 𝐶 𝑘
+
1
𝐾−1 𝑙=1,𝑙≠𝑘
𝐾
𝐷 𝐾𝐿(𝑝𝑙||𝑝 𝑘) and vice versa (𝐿 𝑐 𝑘
: Classification Loss)
• It can be extended to semi-supervised tasks
• (Label information is not required for posterior computation)

Results
• Student networks
• DML performs better
than distillation

Interpret Neural Networks by Identifying Critical Data
Routing Paths (Tsinghua Univ.)
• Interpretable machine learning
algorithm
• Explain or to present in
understandable terms to a human
• Distillation Guided Routing Method
• Discover the critical nodes on the
data routing paths for individual
input samples
• Scalar control gate
• Decide whether each layer’s output
channel is critical for the decision

Methodology
• Pretrained model + Channel-wise Control gates
• Control gates are learned to find the optimal routing decision in the network
• Scale value for each channel
• Distillation Guided Routing
• Perform SGD on the same input for T = 30 iterations
• Most scalar values of the gates should be close to zeros
• Output of the new network should be similar to the original network
• argmin
Λ
𝐿 𝑓𝜃 𝑥 , 𝑓𝜃 𝑥; Λ + +𝛾 𝑘 𝜆 𝑘
• Gradients for control gates:
𝜕𝐿𝑜𝑠𝑠
𝜕Λ
=
𝜕𝐿
𝜕Λ
+ 𝛾 ∗ 𝑠𝑖𝑔𝑛 Λ
• CDRPs representation
• 𝑣 𝑓𝑜𝑟 𝑖𝑚𝑎𝑔𝑒 𝑥 = 𝐶𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒(𝑎𝑙𝑙 Λ)
• Adversarial Samples Detection
• CDRPs comparison

Deep Photo Enhancer: Unpaired Learning for Image Enhancement
from Photographs with GANs (National Taiwan Univ.)
• Problem
• Given a set of photographs w/
desired characteristics
• Transforms an input image
into an enhanced image with
those characteristics
• MIT-Adobe 5K dataset
• 5K images – original images &
several versions of retouched
images
• Competitive samples : retouched
images from photographer C

Network
• Define an enhancement by a set of examples Y
• Input X  U-net based generator  Output (vs Y)  Discriminator
• Add Attention-based feature in the U-net
• To capture global features (such as the sky)
• Can use 2-way GAN for consistency checking

A2-RL: Aesthetics Aware Reinforcement
Learning for Image Cropping
• Cropping the image to improve
aesthetic quality
• AVA dataset*
• Traditional approach: sliding
window method
• Time consuming, fixed aspect ratio
• Weakly supervised Aesthetics
Aware Reinforcement Learning
• Train the agent using the actor-
critic architecture
• Sequential decision making
* N. Murray, L. Marchesotti, and F. Perronnin. Ava: A large- scale database for aesthetic visual analysis. In CVPR, 2012.

RL Agent
• 14 pre-defined action
• Reward function: aesthetic score
• Output of the pretrained view finding network (asthetic ranker) – Trained with same dataset

Distort-and-Recover: Color Enhancement using
Deep Reinforcement Learning (Lunit)
• Distort original image & use original image as a ground truth for
recovering
• Adobe-5K Training set, but only utilizes retouched images
• Training a reinforcement learning agent for color enhancement
• Compare the features & take an action
• Reduce the gap between two images

Neural Style Transfer via Meta Networks
(Peking Univ., National University of Singapore)
• Generate the specified network for
specific style
• through one feed-forward in the meta
networks for neural style transfer
• Don’t need enormous training iterations
to adopt a new style
• Small size neural style transfer
network is generated

Embodied Question Answering
(Georgia Institute of Technology, Facebook AI)
• New AI Task
• 3D environment
• Question  Navigate to
find the answer  Answer

Excluded papers
• NestedNet: Learning Nested Sparse Structures in Deep Neural Networks (SNU)
• Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image
Style Transfer (Durham Univ.)
• Low-Latency Video Semantic Segmentation (CAS)
• Guided Proofreading of Automatic Segmentations for Connectomics (Harvard)
• Generative Adversarial Learning Towards FastWeakly Supervised Detection (Ximan Univ, Microsoft)
• Logo Synthesis and Manipulation with Clustered Generative Adversarial Networks (ETH Zurich)
• Neural Baby Talk(Georgia Institute of Technology, Facebook AI)
• Self-Supervised Feature Learning by Learning to Spot Artifacts(University of Bern)
• CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise (Microsoft AI)

Cvpr 2018 papers review (efficient computing)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cvpr 2018 papers review (efficient computing)

Similar to Cvpr 2018 papers review (efficient computing) (20)

Recently uploaded

Recently uploaded (20)

Cvpr 2018 papers review (efficient computing)

Editor's Notes