Cvpr 2017 Summary Meetup

CVPR 2017 Summary
Assaf Mushinsky
Chief Scientist & Co founder

About the speaker
Assaf Mushinsky
● Co founder & Chief Scientist
● Breakthrough MSc research with Prof. Lior Wolf
● Computer Vision & Deep Learning expert
● Key technical roles at Samsung & Eyesight

Brodmann17
● Founded at 2016
● Raised 2M$
● Team: 10 people → 6 researchers (PhDs, MScs)
● Core Technology: Deep Learning for Edge Devices

Looking for brilliant
researchers
cv@brodmann17.com / amir@brodmann17.com

CVPR 2017
● Huge conference!
○ 4950 registrations
○ 783 accepted papers (out of 2620 valid submissions)
○ 215 orals
○ Also…
■ Tutorials!
■ Workshops!
● Papers available online.
○ Many papers have been published many months ago.
● Videos available in YouTube channel:
○ https://www.youtube.com/channel/UC0n76gicaarsN_Y9YShWwhw

Agenda
● What we are going to talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ “Swiss-knife” network.
● What we are not going to talk about today?
○ Faster R-CNN: https://arxiv.org/abs/1506.01497
○ ResNet: https://arxiv.org/abs/1512.03385

Speed/accuracy trade-offs for modern convolutional object
detectors (Google)
● Presents fair comparison of the leading methods for object detection in terms
of speed and accuracy.
● Single code and training framework for fair comparison.
● Same hardware: Nvidia GeForce GTX Titan X GPU card.
● Multiple hyper-parameters configurations.
● Multiple network architectures.
● Paper: https://arxiv.org/abs/1611.10012
● Code: https://github.com/tensorflow/models/tree/master/object_detection

Speed/accuracy trade-offs for modern convolutional object
detectors (Google)

YOLO9000: Better, Faster, Stronger - Joseph Redmon
● Fast and accurate object detection.
● They improve on their first version of YOLO
by making it better, faster and stronger.
○ Better: YOLOv1 was fast but not accurate.
This one will be more accurate.
○ Faster: New network for faster run time.
○ Stronger: Learn to detect 9000 object
classes.
● Code: http://pjreddie.com/yolo9000/
● Won best paper honorable mention award.

● Batch normalization (+2%)
● High resolution classifier (+4%)
○ Train ImageNet classifier at 224x224
○ Fine-tune ImageNet classifier at 448x448
○ Train detector at 448x448
● Convolutional with anchor boxes
○ YOLOv1 used FC for prediction.
○ Remove FC and predict anchors (-0.3%)
■ But… recall increases 81%→88%
● Select anchors using k-means
● Direct location prediction (+5%)
○ Sigmoid for constrained bounding box
prediction instead of unconstrained as in RPN.

● Fine-grained features (+1%)
○ How to combine features with features from
previous layer?
○ Passthrough layer
■ Take previous 26x26x512 and stack
adjacent features into different
channels, get 13x13x2048.
■ Concatenate with original features
● Multi-scale training (+1%)
○ Every 10 batches randomly choose a new
image dimension size.
○ Forces the network to learn to predict well
across a variety of input dimensions.
● High resolution detector (+2%)
○ Use 544x544 instead of 416x416

YOLO9000: Better?, Faster, Stronger - Joseph Redmon

● We want detection to be accurate but we
also want it to be fast.
● Due to multiscale training, detectors can be
applied at different scales for
speed/accuracy trade off.
● Use Darknet-19 instead of VGG16.
○ Mostly 3x3 convolutions.
○ Like NIN: Use 1x1 filters to compress the
feature representation between 3x3 convs.

● How to learn detection for 9000 classes?
● During training mix images from both detection and
classification datasets.
○ For detection images, use full backprop.
○ For classification images, use only classification part
for backprop.
● Hierarchical classification
○ ImageNet labels are pulled from WordNet.
○ Simplify the problem by building a hierarchical tree
from the concepts in ImageNet.
○ Perform classification using conditional probabilities.
● This formulation also works for detection
○ Instead of assuming that every anchor contains an
object, they use objectness predictor.

Feature Pyramid Networks for Object Detection (FAIR)
● https://arxiv.org/abs/1612.03144

Feature Pyramid Networks for Object Detection (FAIR)

RON: Reverse Connection with Objectness Prior Networks for
Object Detection
● Same idea as “Feature Pyramid Networks for Object Detection”

Accurate Single Stage Detector Using Recurrent Rolling
Convolution
● Same idea as “Feature Pyramid Networks for Object Detection”

Object Detection Circa 2007
Source: Ross Girshick’s object detection tutorial in CVPR 2017 http://deeplearning.csail.mit.edu/instance_ross.pptx

Object Detection Today
Source: Ross Girshick’s object detection tutorial in CVPR 2017 http://deeplearning.csail.mit.edu/instance_ross.pptx

Mask R-CNN - Kaiming He, Ross Girshick (FAIR)
● Instance segmentation with pose estimation
for people.
● Extends faster R-CNN by adding new branch
for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● Girshick won young researcher award.

● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and
avoid the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and
aspect ratio

● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks:
softmax vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection
accuracy.
○ Keypoint task reduces object detection
accuracy.

● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask, and
adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.

Learning non-maximum suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for
same object count as false positives.

● Additional blocks that:
○ Encode pairwise information.
○ For each detection, pool information from all pairings.
○ Update feature vector.
○ Repeat.
● New loss:
○ Only one positive candidate per object.
○ Instead of the current practice to take all objects with IoU>50%

Focal Loss for Dense Object Detection (FAIR)
● Two stage detectors are usually the most
accurate.
● Single stage detectors are simpler and
usually faster.
● Reshaping the cross entropy loss to weight
down well classified samples can improve
the accuracy of single stage detectors.
● This approach is shown to be better the
online hard negative mining.
● Architecture is based on FPN.

Scale aware face detection
● Detection of small objects is
computationally expensive.
● But what if there are no small objects in an
image? Why should we waste computation
on scanning those scales?
● We can divide face detection into two tasks
○ Estimate the scale of faces in a given
image.
○ For each scale, resize to fixed scale and
apply detection.

Realtime Multi-Person 2D Pose Estimation Using Part Affinity
Fields
● Multi-person pose estimation is difficult
○ Unknown number of people
○ Interactions between people makes the
association of parts difficult.
○ Runtime complexity tends to grow with the
number of people in the image.
● The proposed architecture is designed to
jointly learn part locations and their
association.
● Code:
https://github.com/ZheC/Realtime_Multi-P
erson_Pose_Estimation

Fields

Fields
● Two branches:
○ Part location confidence maps.
○ Part affinity fields.
● Multi-stage
○ Every stage get the output of previous stage as
well as the input image.
○ Output is refined over the different stages,
allowing resolution of conflicts.
● Multi-Person Parsing using PAFs

Towards Accurate Multi-person Pose Estimation in the Wild
(Google)
● Two stage cascade model:
○ Apply a Faster-RCNN person detector to
produce a bounding box around each
candidate person instance.
○ Apply a pose estimator to the image crop
extracted around each candidate person
instance in order to localize its keypoints
and re-score the corresponding proposal.
● They have newer version which performs
without object detector and is very similar
to part affinity field method.
● Demo for newer version was presented at
the conference.

LCR-Net: Localization-Classification-Regression for Human
Pose
● https://www.researchgate.net/publication/315867122_LCR-Net_Localization-C
lassification-Regression_for_Human_Pose

Coarse-To-Fine Volumetric Prediction for Single-Image 3D
Human Pose
● Common approaches has drawbacks:
○ Estimating 3D pose by regression of (x,y,z)
○ 2D pose map and 3D refinement
● Solution:
○ 3D pose map estimation.

ArtTrack: Articulated Multi-Person Tracking in the Wild
● How to use temporal information for
multi-person pose tracking?
○ Build spatio-temporal graph, connect all
parts in edges between different parts in
same frame and same part in different
frames.
● Dataset: http://www.posetrack.net
● Code:
https://github.com/eldar/pose-tensorflow

Let’s take a break
When we get back:
Award winning architectures
Efficient neural networks
A single network that does everything

Densely Connected Convolutional Networks
● Residual connections in ResNet allowed
networks to be substantially deeper, more
accurate, and efficient to train.
● Dense connections take this idea further by
connecting every two layers in a block using
channel wise concatenation.
● Code: https://github.com/liuzhuang13/DenseNet
● Memory efficient implementation:
https://arxiv.org/abs/1707.06990
● Won best paper award

● Residual connections
● Dense connections
● Transition layers
○ The dense connectivity can’t be applied when
scale changes.
○ This is why convolution and pooling layers are
added between dense blocks.

● Growth rate
○ Every layer produces k outputs
○ The input for the lth
layer is k×(l-1) feature maps.
○ To prevent the network from growing too wide and
to improve the parameter efficiency k has to be
limited to a small integer.
○ Experiments show k=12 is sufficient to obtain
state-of-the-art results.
● Bottleneck layers
○ Even with a small growth rate, the number of
inputs for some layers can get very large.
○ 1×1 convolution is used to reduce the number of
features to 4k.
● Compression
○ To further improve model compactness, the
transition layer transform the m feature maps of
its input to m/2.

● Stronger gradient flow.
● Parameter and computational efficiency.
● Diversified features due to concatenation of all
previous features.
● Maintains both high & low complexity features.
● Less prone to overfit than ResNet when large
amounts of data isn’t available. Works better
even without augmentation.

Multi-Scale Dense Convolutional Networks
for Efficient Prediction
● Multi-scale networks with multiple classifiers.
● Multiple classifiers allow for cascaded
computation.

Dual Path Networks
● ResNet enables feature re-usage.
● DenseNet enables new features
exploration.
● SOTA accuracy on ImageNet.

● ImageNet classification
Dual Path Networks

● Pascal detection and segmentation.
Dual Path Networks

Deep Roots: Improving CNN Efficiency with Hierarchical Filter
Groups
● Filter group:
○ In normal convolutional layers, all the filters
process all inputs features.
○ Instead, break the filters and input features into
groups.
● Hierarchical filter groups:
○ Start with large number of groups and reduce
them as the model goes deeper.
● Didn’t compare to non-hierarchical filter groups.
● Reduces:
○ Model size
○ Running time
○ Memory consumption
● Can even improve accuracy

Deep Roots: Improving CNN Efficiency with Hierarchical Filter
Groups

Xception: Deep Learning With Depthwise Separable
Convolutions
● Replace inception-like networks
with simple group convolutions.
● Convolutions and depthwise
separable convolutions lie at both
extremes of a discrete spectrum.
● Inception modules being an
intermediate point in between
● Slightly outperforms Inception V3
on the ImageNet dataset
● Significantly outperforms
Inception V3 on a larger
classification dataset with 350
million images and 17,000 classes

Aggregated Residual Transformations for Deep Neural
Networks (ResNeXt)
● ResNet + Inception = ResNeXt
● 2nd place ILSVRC 2016
● Code: https://github.com/facebookresearch/ResNeXt

Aggregated Residual Transformations for Deep Neural
Networks (ResNeXt)
● This is actually equivalent to filter groups.

ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices
● Minimizes the damage of filter groups by
shuffling features between groups

Feedback Networks
● Iterative processing of the input
● Improves on previous iteration using
previous feature and input.

Dilated Residual Networks
● Classification networks gradually reduces
the size of the activations until we are left
with a single feature vector.
● Classification is usually a proxy task used
to pretrained networks before they are
transferred to other applications.
● We lose the spatial information that might
be beneficial to tasks such as localization
or segmentation.

● We can remove the pooling layers and
avoid the dimension reduction.
● But! Removing the pooling layers will
reduce the network’s receptive field and
hurt accuracy.
● How can we avoid spatial information loss
and still have a large receptive field?

● Dilated convolutions:
○ Sparse filter.
○ Same output as filter with stride.
○ Doesn’t skip any input data.
○ Doesn’t change the data size.
● Advantages:
○ Increase receptive field.
○ Increase spatial information.
○ Doesn’t increase the number of network
parameters.

● ResNet to Dilated Residual Network (DRN)
○ Remove stride, compensate with dilation
for groups 4 and 5.
○ Don’t need to apply to 1,2 and 3 because
stride 8 is known to preserve most of the
information.
○ Original output size was 7×7, new output
size is 28×28.
○ Improves recognition of small objects.

● DRN-B-26
○ Replaces early pooling with residual blocks.
○ Adds residual blocks with reduced dilation
at the end of the network.
● DRN-C-26
○ Removes residual connections from some
of the added blocks.
○ Added layers in DRN-B-26 didn’t remove
gridding artifacts due to residual
connections which propagated artifacts.

● ImageNet Classification
○ DRN-A outperforms deeper ResNets with
same number of layers and parameters.
○ Each DRN-C significantly outperforms the
corresponding DRN-A, showing degridding
is beneficial.

● ImageNet weakly-supervised localization
○ Lower is better.
○ DRN-C-26 outperforms DRN-A-50 despite
lower depth and classification accuracy.
○ DRN-C-26 also outperforms ResNet-101.

● Semantic Segmentation
○ ResNet-101 Achieves 66.6 mean IoU

Not All Pixels Are Equal: Difficulty-aware Semantic
Segmentation via Deep Layer Cascade
● Deep layer cascade method that improve the accuracy and speed of semantic segmentation.
● The model is Initially trained as multi-loss model.
● A second training stage jointly fine-tunes the model as a cascade.
● Runs ~15 FPS

Mimicking Very Efficient Network for Object Detection
● Train small network to mimic the output of a
larger one.
○ The large network acts as supervision for
training the smaller network.
○ The small network is trained using L2 loss to
mimic the output of the larger one.
○ Can be expanded to two-stage mimicking for
training efficient Faster R-CNN / R-FCN.
● Experiments
○ R-FCN w/ Inception on Caltech: 7.15
○ R-FCN w/ Inception/2 on Caltech: 8.88
○ R-FCN w/ Inception/2 mimic on Caltech: 7.31
● http://openaccess.thecvf.com/content_cvpr_
2017/papers/Li_Mimicking_Very_Efficient_CV
PR_2017_paper.pdf

Spatially Adaptive Computation Time for Residual Networks
● Automatically learn which pixel to compute
residual functions for and which to simply
keep current value.
● Each layer outputs confidence which
aggregates until pass threshold, then
computation is stopped for this pixel.
● Code: https://github.com/mfigurnov/sact

LCNN: Lookup-based Convolutional Neural Network (XNOR.AI)
● Create a dictionary for convolutions.
● Convolutions are weighted combination.

Binarized Neural Network with Separable Filters
● They build Hubara’s work for binarized NNs.
● Breaking 3x3 filters into 1x3 and 3x1 filters.
● 30% faster, minor drop in accuracy.

Learning From Simulated and Unsupervised Images Through
Adversarial Training (Apple)
● Real train data is expensive. Can we use
simulated data?
○ Simulated data is cheap and we don’t need
to annotate it.
○ There is a gap between simulated and real
image.
● How can we make synthetic images look
more real?
● How can we do that without changing the
properties of the synthetic images?
● They use this method for eye gaze
estimation and hand pose estimation.
● Won best paper award.

Learning From Simulated and Unsupervised Images Through
Adversarial Training (Apple)
● They train GAN to modify the synthetic
image to look more real.
○ The generator modifies the image to fool
the discriminator.
○ The discriminator tries to classify real vs.
synthetic images.
● They make small local changes due to
small receptive field resnet.
● The loss of the discriminator is local
because it ends with a loss map instead of
single loss.
● Humans got 80% on synthetic vs real
images but only 51% accuracy on refined
vs. real images.

A-Fast-RCNN: Hard Positive Generation via Adversary for
Object Detection
● Adversarial network that generates examples with occlusions and
deformations.

Training Object Class Detectors With Click Supervision
● x9 faster labeling speed than fully
supervised.
● Not comparison to state of the art in terms
of accuracy.
● Two click validation helps determining the
scale of the object.
● Start with annotator verification process
using pre-labeled test set.

Making Deep Neural Networks Robust to Label Noise: A Loss
Correction Approach
● Labels are expensive to obtain because
they require human labeling.
● They want to avoid the need for a set of
clean labels, or knowledge of the noise
statistics.
● During training, correct the loss function by
reweighting the loss according to
estimated noise between classes.

Harvesting Multiple Views for Marker-Less 3D Human Pose
Annotations
● Use pretrained pose net to estimate
probability map for each part.
● Do this for multiple views.
● Fuse information into single pose
estimation.
● Use this pose as new ground truth for
training.
● Automatic annotations help to improve
accuracy.

Ubernet: Training a Universal Convolutional Neural Network
● Computer vision involves a host of tasks,
such as boundary detection, semantic
segmentation, surface estimation, object
detection, image classification.
● In a joint application, running a network for
each task in feasible.
● Can one network solve all of our computer
vision tasks?
○ Of course. Naively combine multiple
networks and get a single network.
○ Can we do better?

● How do we train multiple tasks without having single dataset for all tasks?

● Architecture
○ Based on VGG16.
○ A minimal number of additional, task-specific layers.
○ Skip layers to combine the best features for every task.
○ Skip-layer connection are normalized using batch norm
○ Multi-resolution CNN
○ Atrous convolution
● Training loss
○ Adapt loss per sample.
■ Zero loss when ground truth is missing.
○ Asynchronous SGD
■ Accumulate gradients for each tasks
■ Only update weights when seen enough
samples for specific task.

● Low memory back-propagation

Pascal In Detail - Make Pascal Great Again!
● https://sites.google.com/view/pasd
● Measure the progress in image
understanding as reflected in a diverse set
of visual tasks.
● Single-Task Challenges
○ Image Classification, Object Detection,
Semantic Segmentation, Instance
Segmentation, Object Part Segmentation,
Objectness, Boundary Detection, Occlusion
Recognition, Human Keypoint Estimation,
Human Action Recognition,
● Multi-Task Challenges
○ Boxes to Points Triathlon: Object Detection,
Instance Segmentation, Keypoint
Estimation
○ PASCAL++ Triathlon: Image Classification,
Object Detection, Semantic Segmentation
○ Humans in Detail Triathlon: Human Parts,
Keypoints, Action
● PASCAL Decathlon:
○ All 10 tasks

Taster - Visual Domain Decathlon
● http://www.robots.ox.ac.uk/~vgg/decathlon/
● Solve ten image classification problems simultaneously.
a. Aircraft
b. CIFAR-100
c. Daimler pedestrian
d. Describable textures
e. German traffic signs
f. ImageNet
g. VGG-Flowers
h. Omniglot
i. SVHN
j. UCF101 Dynamic Images

Learning multiple visual domains with residual adapters
● Primary goal is to develop neural network architectures that can work well in a
multiple-domain setting.
● Learn adapters that can be replaced for specific tasks.

Incremental Learning Through Deep Adaptation (Amir
Rosenfield)
● It is often desirable to be able to add new capabilities without hindering
performance of already learned tasks.
● Fully preserves performance on the original task, with only a small increase
(around 20%) in the number of required parameters.
○ Other methods typically double the number of parameters.
● The learned architecture can be controlled to switch between various learned
representations, enabling a single network to solve a task from multiple
different domains.
● Slides: https://sites.google.com/view/amirrosenfeld
● Challenge winner!

Rosenfield)
Method Old Task
Perf.
New Task
Perf.
No. Params Knowledge
Reuse?
Train From Scratch Same Good High No
Fine-Tune last layer Same Suboptimal Low Yes
Fine-Tune all layers Decrease Best High Yes
Deep Adaptation
(proposed)
Same Best Low Yes

Rosenfield)
● Basic idea:
○ Train network N1 on task T1.
○ For task Ti, train Ni by learning how to reuse the filters of N1
○ Reuse == make new filters by linear combinations of learned ones + bias.
○ Reparametrize network dynamically based on task.
● Can control multiple task using a vector of ’s
Orig. Filters
New Filters
Modified Filters Original Filters
Switching Variable Input

Conclusions
● What did we talk about today?
○ One network to rule them all

Conclusions
Networks keep getting more complex

Conclusions
State of the art keeps improving

Conclusions
But still need to be efficient!

Cvpr 2017 Summary Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cvpr 2017 Summary Meetup

Similar to Cvpr 2017 Summary Meetup (20)

Recently uploaded

Recently uploaded (20)

Cvpr 2017 Summary Meetup