[212]big models without big data using domain specific deep networks in data-scarce settings

Big models without big data:
Using deep networks for computer vision in
data-scarce settings
Jon Almazan, Cesar de Souza, Yohann Cabon,
Diane Larlus, Naila Murray, Jerome Revaud

Naver Labs Contributors
Yohann Cabon
Jerome Revaud
Cesar de Souza
Diane Larlus
Jon Almazan
Naila Murray

Deep learning for computer vision:
The data-scarcity challenge
Supervised deep learning :
J State-of-the-art for many CV tasks
L Requires lots of annotated data
Visual data is cheap and plentiful
Annotated data may be:
• Expensive
• Proprietary
• Non-feasible
How to use deep learning in data-scarce settings?
3
24 hrs of Photographyby Erik Kessels

Dealing with data-scarcity
4
Data synthesis
Domain adaptation
Data cleaning

5
Data synthesis
Domain adaptation
Data cleaning

Domain Adaptation
Leveraging annotated data in one or more related source
domains, to learn a model for unseen data in a target domain

Ground truth Prediction by PDP
Context: Attention prediction
7
Task: predict topographical attention map
Existing approaches: model it as a classification or regression task
Our approach: model attention as a stochastic process, using
probability distribution prediction (PDP)
Jetley, Murray, Vig. End-to-End Saliency Mapping via Probability Distribution Prediction. CVPR 2016.

Approach
Model attention map as a generalized Bernoulli distribution
Apply novel loss functions that penalize distance btw. predicted(p) and target(t) distributions
Use fully-convolutional architecture for probability distribution prediction
8

Data
Ground-truth attention data:
• Normally collected with eye-trackers
• Very expensive to collect
Jiang et al.*:
• introduce SALICON dataset
• use mouse-tracking as proxy:
We train our models with SALICON and fine-tune/test on
eye-tracking data
9
*Jiang et al. SALICON: Saliency in Context. CVPR 2015.
University of Kent

Results
10
Convergence of AUC using different loss functions Performance on SALICON test set
Results in source domain: mouse-tracking prediction

Results
11
OSIE dataset
VOCA 2012 dataset
Results in target domain:
task-free eye-tracking prediction
Results in target domain:
task-dependent eye-tracking prediction

Conclusion
12
Problem: attention map prediction
using limited target data
Solution: training with appropriate loss
functions, and pre-training with proxy
data

13
Data synthesis
Domain adaptation
Data cleaning

Context: Instance-level Retrieval
Principle: Given a query image, find similar images in a (large)
database
14

Recent approaches
Recent methods leverage deep learning:
J Representations are compact and fast at test time!
Use standard networks designed for image classification:
L Not designed for retrieval
L Results significantly below the state-of-the-art
15

Can we learn to represent images for
retrieval?
Yes, if:
1. Training data is available
2. The network architecture can capture fine details
3. Training focuses on retrieval
16Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.
Gordo, Almazan, Revaud, Larlus. End-to-End Learning of Deep Visual Representations for Image Retrieval. IJCV 2017.

Obtaining Training Data
Public dataset of landmark images
• ~200K images
• 600 different landmarks (Eiffel tower, Rome colosseum, Big Ben…)
• Extremely noisy. Learning fails without clean data.
17
[Babenko et al, Neural codes @ ECCV14]
Prototypical view
Non-prototypical view
Wrong category

We proposed an automatic cleaning technique:
• Create graph per class using image matching
• Prune edges corresponding to low matching scores
• Use verified keypoint matches to mine bounding boxes
18
• ~200K images
Gordo, Almazan, Revaud, Larlus. Deep Image Retrieval: Learning global representations for image search. ECCV 2016.

We proposed an automatic cleaning technique, resulting in:
• 40K spatially verified images
• Approximate bounding box annotations
• A new cleaned dataset, now publicly available
19
• ~200K images

Proposed approach
Learning to rank images:
We propose a new three-stream Siamese Network: a network designed for
retrieval

Experimental evaluation on standard
benchmarks
Oxford dataset
• 5k images
• 5k images + 100k distractor images
Paris dataset
• 6k images
INRIA Holidays dataset
• 1491 images
21

Experiments: Oxford 5k and Oxford 105k
Xerox Confidential 22
Deep Traditional Ours Deep Traditional Ours
82.7
84.3 84.9
86.9
89.4
50
60
70
80
90
100
MeanAveragePrecision
Oxford 5k
55.7
53.1
71.6 72.2
77.3
85
82.7
84.3 84.9
86.9
89.4
50
60
70
80
90
100
Oxford 5k
76.7
80.2 79.5
85.3 84
45
50
55
60
65
70
75
80
85
90
95
100
Oxford 105K
52.3
50.1
67.8
73.2
81.8
76.7
80.2 79.5
85.3 84
45
50
55
60
65
70
75
80
85
90
95
100
Oxford 105K
52.3
50.1
67.8
73.2
81.8
76.7
80.2 79.5
85.3 84
93.6
45
50
55
60
65
70
75
80
85
90
95
100
Oxford 105K
55.7
53.1
71.6 72.2
77.3
85
82.7
84.3 84.9
86.9
89.4
94.7
50
60
70
80
90
100
Oxford 5k

Experiments: Paris 6k and INRIA Holidays
Xerox Confidential 23
Deep Traditional Ours Deep Traditional Ours
79.7
85.5
86.5 86.5
80.5
83.4
82.4
85.1
82.8
96.7
60
65
70
75
80
85
90
95
100
Paris 6K
78.9
82
87.5
84.9
82.5
84.7
75.8
81.3
94.8
70
75
80
85
90
95
100
INRIA Holidays

Qualitative results

Conclusion
25
Problem: efficient instance-level image retrieval using deep networks
Solution: training with reliable annotations and an appropriate model architecture
Query

26
Data synthesis
Domain adaptation
Data cleaning

Synthetic Data for Computer Vision
Benefits
• Complete control
• Automatic annotations
• Quantity & variability
Challenges
• Chicken & egg problem?
• Technically feasible and cost-effective?
Our solution
• Off-the-shelf game engine (Unity)
• Seeding virtual worlds with limited real-world sensor data
• Automatic generation of all labels via shader programming
27

28
Gaidon et al. Virtual Worlds as Proxy
for Multi-Object Tracking Analysis.
CVPR 2016
Ros et al. The synthia dataset: A large collection of synthetic images
for semantic segmentation of urban scenes. CVPR 2016
Richter et al. Playing for Data: Ground Truth from
Computer Games. ECCV 2016
Synthetic Data for Computer Vision

Virtual worlds for action classification
From modelling vehicles to modelling human actions:
Orders of magnitude increase in complexity:
• non-rigid motion
• complex interactions with objects and people
• large diversity in viewpoints and appearance
How to create diverse, realistic, and physically-plausible
training videos?
Our solution: Procedural Human Action Videos (PHAV):
• generative model of human action videos
29
de Souza, Cabon, Gaidon, Lopez. Procedural Generation of Videos to Train Deep Action Recognition Networks. CVPR 2017.

30

Procedural Human Action Videos
PHAV Data modalities:
• RGB
• Depth
• Semantic Segmentation
• Instance Segmentation
• Horizontal Flow
• Vertical Flow
Extracted using Multiple Render Targets
31

32

33
Adding PHAV helps training, particularly when real-world data is limited:
Naver Labs

Conclusion
34
Problem: generate large-scale annotated synthetic videos useful for CV
Solution: modern game engine, real to virtual cloning, shaders

35
Data synthesis
Domain adaptation
Data cleaning

Some numbers
Time to train the network: ~1 week on a single M40 GPU
Time to encode images: ~10 images per second on an M40 GPU
Total size per encoded image: 8Kb (128 images per Mb; dim=2048)
Time to compare images: millions of comparisons per second
• After PQ compression: 256 bytes/image with minor decrease in accuracy
Training memory requirements: ~3 x 7Gb
• 3-stream residual networks do not naively fit in memory!
• Each stream is processed sequentially: only one stream active at a time
38

[212]big models without big data using domain specific deep networks in data-scarce settings

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [212]big models without big data using domain specific deep networks in data-scarce settings

Similar to [212]big models without big data using domain specific deep networks in data-scarce settings (20)

More from NAVER D2

More from NAVER D2 (20)

Recently uploaded

Recently uploaded (20)

[212]big models without big data using domain specific deep networks in data-scarce settings