This is an intensive meetup at Samsung Next IL covering most interesting papers that were presented in CVPR 2017 last month. It is a good opportunity to have an overview of recent advancements in the field of Deep Learning with applications to Computer-Vision.
The following topics are covered:
• Object detection
• Pose estimation
• Efficient networks
2. About the speaker
Assaf Mushinsky
● Co founder & Chief Scientist
● Breakthrough MSc research with Prof. Lior Wolf
● Computer Vision & Deep Learning expert
● Key technical roles at Samsung & Eyesight
3. Brodmann17
● Founded at 2016
● Raised 2M$
● Team: 10 people → 6 researchers (PhDs, MScs)
● Core Technology: Deep Learning for Edge Devices
5. CVPR 2017
● Huge conference!
○ 4950 registrations
○ 783 accepted papers (out of 2620 valid submissions)
○ 215 orals
○ Also…
■ Tutorials!
■ Workshops!
● Papers available online.
○ Many papers have been published many months ago.
● Videos available in YouTube channel:
○ https://www.youtube.com/channel/UC0n76gicaarsN_Y9YShWwhw
6.
7.
8. Agenda
● What we are going to talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ “Swiss-knife” network.
● What we are not going to talk about today?
○ Faster R-CNN: https://arxiv.org/abs/1506.01497
○ ResNet: https://arxiv.org/abs/1512.03385
10. Speed/accuracy trade-offs for modern convolutional object
detectors (Google)
● Presents fair comparison of the leading methods for object detection in terms
of speed and accuracy.
● Single code and training framework for fair comparison.
● Same hardware: Nvidia GeForce GTX Titan X GPU card.
● Multiple hyper-parameters configurations.
● Multiple network architectures.
● Paper: https://arxiv.org/abs/1611.10012
● Code: https://github.com/tensorflow/models/tree/master/object_detection
13. YOLO9000: Better, Faster, Stronger - Joseph Redmon
● Fast and accurate object detection.
● They improve on their first version of YOLO
by making it better, faster and stronger.
○ Better: YOLOv1 was fast but not accurate.
This one will be more accurate.
○ Faster: New network for faster run time.
○ Stronger: Learn to detect 9000 object
classes.
● Paper: https://arxiv.org/abs/1612.08242
● Code: http://pjreddie.com/yolo9000/
● Won best paper honorable mention award.
14. YOLO9000: Better, Faster, Stronger - Joseph Redmon
● Batch normalization (+2%)
● High resolution classifier (+4%)
○ Train ImageNet classifier at 224x224
○ Fine-tune ImageNet classifier at 448x448
○ Train detector at 448x448
● Convolutional with anchor boxes
○ YOLOv1 used FC for prediction.
○ Remove FC and predict anchors (-0.3%)
■ But… recall increases 81%→88%
● Select anchors using k-means
● Direct location prediction (+5%)
○ Sigmoid for constrained bounding box
prediction instead of unconstrained as in RPN.
15. YOLO9000: Better, Faster, Stronger - Joseph Redmon
● Fine-grained features (+1%)
○ How to combine features with features from
previous layer?
○ Passthrough layer
■ Take previous 26x26x512 and stack
adjacent features into different
channels, get 13x13x2048.
■ Concatenate with original features
● Multi-scale training (+1%)
○ Every 10 batches randomly choose a new
image dimension size.
○ Forces the network to learn to predict well
across a variety of input dimensions.
● High resolution detector (+2%)
○ Use 544x544 instead of 416x416
19. YOLO9000: Better, Faster, Stronger - Joseph Redmon
● We want detection to be accurate but we
also want it to be fast.
● Due to multiscale training, detectors can be
applied at different scales for
speed/accuracy trade off.
● Use Darknet-19 instead of VGG16.
○ Mostly 3x3 convolutions.
○ Like NIN: Use 1x1 filters to compress the
feature representation between 3x3 convs.
20. YOLO9000: Better, Faster, Stronger - Joseph Redmon
● How to learn detection for 9000 classes?
● During training mix images from both detection and
classification datasets.
○ For detection images, use full backprop.
○ For classification images, use only classification part
for backprop.
● Hierarchical classification
○ ImageNet labels are pulled from WordNet.
○ Simplify the problem by building a hierarchical tree
from the concepts in ImageNet.
○ Perform classification using conditional probabilities.
● This formulation also works for detection
○ Instead of assuming that every anchor contains an
object, they use objectness predictor.
24. RON: Reverse Connection with Objectness Prior Networks for
Object Detection
● Same idea as “Feature Pyramid Networks for Object Detection”
● https://arxiv.org/abs/1707.01691
25. Accurate Single Stage Detector Using Recurrent Rolling
Convolution
● Same idea as “Feature Pyramid Networks for Object Detection”
● https://arxiv.org/abs/1704.05776
26. Object Detection Circa 2007
Source: Ross Girshick’s object detection tutorial in CVPR 2017 http://deeplearning.csail.mit.edu/instance_ross.pptx
27. Object Detection Today
Source: Ross Girshick’s object detection tutorial in CVPR 2017 http://deeplearning.csail.mit.edu/instance_ross.pptx
28. Mask R-CNN - Kaiming He, Ross Girshick (FAIR)
● Instance segmentation with pose estimation
for people.
● Extends faster R-CNN by adding new branch
for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● https://arxiv.org/abs/1703.06870
● Girshick won young researcher award.
32. Mask R-CNN - Kaiming He, Ross Girshick (FAIR)
● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and
avoid the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and
aspect ratio
34. Mask R-CNN - Kaiming He, Ross Girshick (FAIR)
● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks:
softmax vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection
accuracy.
○ Keypoint task reduces object detection
accuracy.
35. Mask R-CNN - Kaiming He, Ross Girshick (FAIR)
● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask, and
adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.
36. Learning non-maximum suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for
same object count as false positives.
● https://arxiv.org/abs/1705.02950
37. Learning non-maximum suppression
● Additional blocks that:
○ Encode pairwise information.
○ For each detection, pool information from all pairings.
○ Update feature vector.
○ Repeat.
● New loss:
○ Only one positive candidate per object.
○ Instead of the current practice to take all objects with IoU>50%
39. Focal Loss for Dense Object Detection (FAIR)
● Two stage detectors are usually the most
accurate.
● Single stage detectors are simpler and
usually faster.
● Reshaping the cross entropy loss to weight
down well classified samples can improve
the accuracy of single stage detectors.
● This approach is shown to be better the
online hard negative mining.
● Architecture is based on FPN.
● https://arxiv.org/abs/1708.02002
40. Scale aware face detection
● Detection of small objects is
computationally expensive.
● But what if there are no small objects in an
image? Why should we waste computation
on scanning those scales?
● We can divide face detection into two tasks
○ Estimate the scale of faces in a given
image.
○ For each scale, resize to fixed scale and
apply detection.
● https://arxiv.org/abs/1706.09876
42. Realtime Multi-Person 2D Pose Estimation Using Part Affinity
Fields
● Multi-person pose estimation is difficult
○ Unknown number of people
○ Interactions between people makes the
association of parts difficult.
○ Runtime complexity tends to grow with the
number of people in the image.
● The proposed architecture is designed to
jointly learn part locations and their
association.
● Paper: https://arxiv.org/abs/1611.08050
● Code:
https://github.com/ZheC/Realtime_Multi-P
erson_Pose_Estimation
45. Realtime Multi-Person 2D Pose Estimation Using Part Affinity
Fields
● Two branches:
○ Part location confidence maps.
○ Part affinity fields.
● Multi-stage
○ Every stage get the output of previous stage as
well as the input image.
○ Output is refined over the different stages,
allowing resolution of conflicts.
● Multi-Person Parsing using PAFs
46. Towards Accurate Multi-person Pose Estimation in the Wild
(Google)
● Two stage cascade model:
○ Apply a Faster-RCNN person detector to
produce a bounding box around each
candidate person instance.
○ Apply a pose estimator to the image crop
extracted around each candidate person
instance in order to localize its keypoints
and re-score the corresponding proposal.
● https://arxiv.org/abs/1701.01779
● They have newer version which performs
without object detector and is very similar
to part affinity field method.
● Demo for newer version was presented at
the conference.
48. Coarse-To-Fine Volumetric Prediction for Single-Image 3D
Human Pose
● Common approaches has drawbacks:
○ Estimating 3D pose by regression of (x,y,z)
○ 2D pose map and 3D refinement
● Solution:
○ 3D pose map estimation.
● https://arxiv.org/abs/1611.07828
49. ArtTrack: Articulated Multi-Person Tracking in the Wild
● How to use temporal information for
multi-person pose tracking?
○ Build spatio-temporal graph, connect all
parts in edges between different parts in
same frame and same part in different
frames.
● Paper: https://arxiv.org/abs/1612.01465
● Dataset: http://www.posetrack.net
● Code:
https://github.com/eldar/pose-tensorflow
50. Let’s take a break
When we get back:
Award winning architectures
Efficient neural networks
A single network that does everything
52. Densely Connected Convolutional Networks
● Residual connections in ResNet allowed
networks to be substantially deeper, more
accurate, and efficient to train.
● Dense connections take this idea further by
connecting every two layers in a block using
channel wise concatenation.
● Paper: https://arxiv.org/abs/1608.06993
● Code: https://github.com/liuzhuang13/DenseNet
● Memory efficient implementation:
https://arxiv.org/abs/1707.06990
● Won best paper award
53. Densely Connected Convolutional Networks
● Residual connections
● Dense connections
● Transition layers
○ The dense connectivity can’t be applied when
scale changes.
○ This is why convolution and pooling layers are
added between dense blocks.
57. Densely Connected Convolutional Networks
● Growth rate
○ Every layer produces k outputs
○ The input for the lth
layer is k×(l-1) feature maps.
○ To prevent the network from growing too wide and
to improve the parameter efficiency k has to be
limited to a small integer.
○ Experiments show k=12 is sufficient to obtain
state-of-the-art results.
● Bottleneck layers
○ Even with a small growth rate, the number of
inputs for some layers can get very large.
○ 1×1 convolution is used to reduce the number of
features to 4k.
● Compression
○ To further improve model compactness, the
transition layer transform the m feature maps of
its input to m/2.
59. Densely Connected Convolutional Networks
● Stronger gradient flow.
● Parameter and computational efficiency.
● Diversified features due to concatenation of all
previous features.
● Maintains both high & low complexity features.
● Less prone to overfit than ResNet when large
amounts of data isn’t available. Works better
even without augmentation.
60. Multi-Scale Dense Convolutional Networks
for Efficient Prediction
● Multi-scale networks with multiple classifiers.
● Multiple classifiers allow for cascaded
computation.
● Paper: https://arxiv.org/abs/1703.09844
61. Dual Path Networks
● ResNet enables feature re-usage.
● DenseNet enables new features
exploration.
● SOTA accuracy on ImageNet.
● https://arxiv.org/abs/1707.01629
64. Deep Roots: Improving CNN Efficiency with Hierarchical Filter
Groups
● Filter group:
○ In normal convolutional layers, all the filters
process all inputs features.
○ Instead, break the filters and input features into
groups.
● Hierarchical filter groups:
○ Start with large number of groups and reduce
them as the model goes deeper.
● Didn’t compare to non-hierarchical filter groups.
● Reduces:
○ Model size
○ Running time
○ Memory consumption
● Can even improve accuracy
● https://arxiv.org/abs/1605.06489
66. Xception: Deep Learning With Depthwise Separable
Convolutions
● Replace inception-like networks
with simple group convolutions.
● Convolutions and depthwise
separable convolutions lie at both
extremes of a discrete spectrum.
● Inception modules being an
intermediate point in between
● Slightly outperforms Inception V3
on the ImageNet dataset
● Significantly outperforms
Inception V3 on a larger
classification dataset with 350
million images and 17,000 classes
● https://arxiv.org/abs/1610.02357
67. Aggregated Residual Transformations for Deep Neural
Networks (ResNeXt)
● ResNet + Inception = ResNeXt
● 2nd place ILSVRC 2016
● Paper: https://arxiv.org/abs/1611.05431
● Code: https://github.com/facebookresearch/ResNeXt
69. ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices
● Minimizes the damage of filter groups by
shuffling features between groups
● https://arxiv.org/abs/1707.01083
70. Feedback Networks
● Iterative processing of the input
● Improves on previous iteration using
previous feature and input.
● https://arxiv.org/abs/1612.09508
71. Dilated Residual Networks
● Classification networks gradually reduces
the size of the activations until we are left
with a single feature vector.
● Classification is usually a proxy task used
to pretrained networks before they are
transferred to other applications.
● We lose the spatial information that might
be beneficial to tasks such as localization
or segmentation.
● https://arxiv.org/abs/1705.09914
72. Dilated Residual Networks
● We can remove the pooling layers and
avoid the dimension reduction.
● But! Removing the pooling layers will
reduce the network’s receptive field and
hurt accuracy.
● How can we avoid spatial information loss
and still have a large receptive field?
73. Dilated Residual Networks
● Dilated convolutions:
○ Sparse filter.
○ Same output as filter with stride.
○ Doesn’t skip any input data.
○ Doesn’t change the data size.
● Advantages:
○ Increase receptive field.
○ Increase spatial information.
○ Doesn’t increase the number of network
parameters.
74. Dilated Residual Networks
● ResNet to Dilated Residual Network (DRN)
○ Remove stride, compensate with dilation
for groups 4 and 5.
○ Don’t need to apply to 1,2 and 3 because
stride 8 is known to preserve most of the
information.
○ Original output size was 7×7, new output
size is 28×28.
○ Improves recognition of small objects.
75. Dilated Residual Networks
● DRN-B-26
○ Replaces early pooling with residual blocks.
○ Adds residual blocks with reduced dilation
at the end of the network.
● DRN-C-26
○ Removes residual connections from some
of the added blocks.
○ Added layers in DRN-B-26 didn’t remove
gridding artifacts due to residual
connections which propagated artifacts.
76. Dilated Residual Networks
● ImageNet Classification
○ DRN-A outperforms deeper ResNets with
same number of layers and parameters.
○ Each DRN-C significantly outperforms the
corresponding DRN-A, showing degridding
is beneficial.
77. Dilated Residual Networks
● ImageNet weakly-supervised localization
○ Lower is better.
○ DRN-C-26 outperforms DRN-A-50 despite
lower depth and classification accuracy.
○ DRN-C-26 also outperforms ResNet-101.
80. Not All Pixels Are Equal: Difficulty-aware Semantic
Segmentation via Deep Layer Cascade
● Deep layer cascade method that improve the accuracy and speed of semantic segmentation.
● The model is Initially trained as multi-loss model.
● A second training stage jointly fine-tunes the model as a cascade.
● Runs ~15 FPS
● https://arxiv.org/abs/1704.01344
81. Mimicking Very Efficient Network for Object Detection
● Train small network to mimic the output of a
larger one.
○ The large network acts as supervision for
training the smaller network.
○ The small network is trained using L2 loss to
mimic the output of the larger one.
○ Can be expanded to two-stage mimicking for
training efficient Faster R-CNN / R-FCN.
● Experiments
○ R-FCN w/ Inception on Caltech: 7.15
○ R-FCN w/ Inception/2 on Caltech: 8.88
○ R-FCN w/ Inception/2 mimic on Caltech: 7.31
● http://openaccess.thecvf.com/content_cvpr_
2017/papers/Li_Mimicking_Very_Efficient_CV
PR_2017_paper.pdf
82. Spatially Adaptive Computation Time for Residual Networks
● Automatically learn which pixel to compute
residual functions for and which to simply
keep current value.
● Each layer outputs confidence which
aggregates until pass threshold, then
computation is stopped for this pixel.
● Paper: https://arxiv.org/abs/1612.02297
● Code: https://github.com/mfigurnov/sact
83. LCNN: Lookup-based Convolutional Neural Network (XNOR.AI)
● Create a dictionary for convolutions.
● Convolutions are weighted combination.
● https://arxiv.org/abs/1611.06473
84. Binarized Neural Network with Separable Filters
● They build Hubara’s work for binarized NNs.
● Breaking 3x3 filters into 1x3 and 3x1 filters.
● 30% faster, minor drop in accuracy.
● https://arxiv.org/abs/1707.04693
86. Learning From Simulated and Unsupervised Images Through
Adversarial Training (Apple)
● Real train data is expensive. Can we use
simulated data?
○ Simulated data is cheap and we don’t need
to annotate it.
○ There is a gap between simulated and real
image.
● How can we make synthetic images look
more real?
● How can we do that without changing the
properties of the synthetic images?
● They use this method for eye gaze
estimation and hand pose estimation.
● Paper: https://arxiv.org/abs/1612.07828
● Won best paper award.
87. Learning From Simulated and Unsupervised Images Through
Adversarial Training (Apple)
● They train GAN to modify the synthetic
image to look more real.
○ The generator modifies the image to fool
the discriminator.
○ The discriminator tries to classify real vs.
synthetic images.
● They make small local changes due to
small receptive field resnet.
● The loss of the discriminator is local
because it ends with a loss map instead of
single loss.
● Humans got 80% on synthetic vs real
images but only 51% accuracy on refined
vs. real images.
88. A-Fast-RCNN: Hard Positive Generation via Adversary for
Object Detection
● Adversarial network that generates examples with occlusions and
deformations.
● https://arxiv.org/abs/1704.03414
89. Training Object Class Detectors With Click Supervision
● x9 faster labeling speed than fully
supervised.
● Not comparison to state of the art in terms
of accuracy.
● Two click validation helps determining the
scale of the object.
● Start with annotator verification process
using pre-labeled test set.
● https://arxiv.org/abs/1704.06189
90. Making Deep Neural Networks Robust to Label Noise: A Loss
Correction Approach
● Labels are expensive to obtain because
they require human labeling.
● They want to avoid the need for a set of
clean labels, or knowledge of the noise
statistics.
● During training, correct the loss function by
reweighting the loss according to
estimated noise between classes.
● https://arxiv.org/abs/1609.03683
91. Harvesting Multiple Views for Marker-Less 3D Human Pose
Annotations
● Use pretrained pose net to estimate
probability map for each part.
● Do this for multiple views.
● Fuse information into single pose
estimation.
● Use this pose as new ground truth for
training.
● Automatic annotations help to improve
accuracy.
93. Ubernet: Training a Universal Convolutional Neural Network
● Computer vision involves a host of tasks,
such as boundary detection, semantic
segmentation, surface estimation, object
detection, image classification.
● In a joint application, running a network for
each task in feasible.
● Can one network solve all of our computer
vision tasks?
○ Of course. Naively combine multiple
networks and get a single network.
○ Can we do better?
● https://arxiv.org/abs/1609.02132
94. Ubernet: Training a Universal Convolutional Neural Network
● How do we train multiple tasks without having single dataset for all tasks?
95. Ubernet: Training a Universal Convolutional Neural Network
● Architecture
○ Based on VGG16.
○ A minimal number of additional, task-specific layers.
○ Skip layers to combine the best features for every task.
○ Skip-layer connection are normalized using batch norm
○ Multi-resolution CNN
○ Atrous convolution
● Training loss
○ Adapt loss per sample.
■ Zero loss when ground truth is missing.
○ Asynchronous SGD
■ Accumulate gradients for each tasks
■ Only update weights when seen enough
samples for specific task.
96. Ubernet: Training a Universal Convolutional Neural Network
● Low memory back-propagation
98. Pascal In Detail - Make Pascal Great Again!
● https://sites.google.com/view/pasd
● Measure the progress in image
understanding as reflected in a diverse set
of visual tasks.
● Single-Task Challenges
○ Image Classification, Object Detection,
Semantic Segmentation, Instance
Segmentation, Object Part Segmentation,
Objectness, Boundary Detection, Occlusion
Recognition, Human Keypoint Estimation,
Human Action Recognition,
● Multi-Task Challenges
○ Boxes to Points Triathlon: Object Detection,
Instance Segmentation, Keypoint
Estimation
○ PASCAL++ Triathlon: Image Classification,
Object Detection, Semantic Segmentation
○ Humans in Detail Triathlon: Human Parts,
Keypoints, Action
● PASCAL Decathlon:
○ All 10 tasks
99. Taster - Visual Domain Decathlon
● http://www.robots.ox.ac.uk/~vgg/decathlon/
● Solve ten image classification problems simultaneously.
a. Aircraft
b. CIFAR-100
c. Daimler pedestrian
d. Describable textures
e. German traffic signs
f. ImageNet
g. VGG-Flowers
h. Omniglot
i. SVHN
j. UCF101 Dynamic Images
100. Learning multiple visual domains with residual adapters
● Primary goal is to develop neural network architectures that can work well in a
multiple-domain setting.
● Learn adapters that can be replaced for specific tasks.
● https://arxiv.org/abs/1705.08045
101. Incremental Learning Through Deep Adaptation (Amir
Rosenfield)
● It is often desirable to be able to add new capabilities without hindering
performance of already learned tasks.
● Fully preserves performance on the original task, with only a small increase
(around 20%) in the number of required parameters.
○ Other methods typically double the number of parameters.
● The learned architecture can be controlled to switch between various learned
representations, enabling a single network to solve a task from multiple
different domains.
● https://arxiv.org/abs/1705.04228
● Slides: https://sites.google.com/view/amirrosenfeld
● Challenge winner!
102. Incremental Learning Through Deep Adaptation (Amir
Rosenfield)
Method Old Task
Perf.
New Task
Perf.
No. Params Knowledge
Reuse?
Train From Scratch Same Good High No
Fine-Tune last layer Same Suboptimal Low Yes
Fine-Tune all layers Decrease Best High Yes
Deep Adaptation
(proposed)
Same Best Low Yes
103. Incremental Learning Through Deep Adaptation (Amir
Rosenfield)
● Basic idea:
○ Train network N1 on task T1.
○ For task Ti, train Ni by learning how to reuse the filters of N1
○ Reuse == make new filters by linear combinations of learned ones + bias.
○ Reparametrize network dynamically based on task.
● Can control multiple task using a vector of ’s
Orig. Filters
New Filters
Modified Filters Original Filters
Switching Variable Input
104. Conclusions
● What did we talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ One network to rule them all
105. Conclusions
● What did we talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ One network to rule them all
Networks keep getting more complex
106. Conclusions
● What did we talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ One network to rule them all
State of the art keeps improving
107. Conclusions
● What did we talk about today?
○ Object detection and segmentation: people, faces and pose estimation.
○ New and exciting network architectures
○ Efficient deep learning: optimization and cascades
○ Multiscale information
○ Data augmentation, generation and synthesis
○ One network to rule them all
But still need to be efficient!