2. DEEP
LEARNING
→
COMPUTER
VISION
➢Success in applying deep learning to a
number of computer vision problems
➢Image classification
➢Image segmentation
➢Object detection, tracking, recognition
➢Image processing (style transfer, super-resolution,
deblurring, etc.)
➢3D reconstruction
➢Image/video captioning/Q&A
➢…….
2
3. CNN (CONVOLUTIONAL NEURAL NETWORKS)
CNN has been successfully used in many image classification and object
detection problems.
The most famous CNN architecture is LeNet used to classify hand-written digits.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of IEEE, 1998
4. MODEL COMPLEXITY
The complexity of a learning model can be characterized by the number of
parameters in the model.
A large number of model parameters means it can learn/model very complex
structure.
Recent deep networks usually contain millions of parameters to be learned from
the training data.
4Figures adopted from Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press
5. DATASET SIZE
5
The age of “Big Data” has made machine learning much easier because it provides a
very large number of training data of wide varieties for the deep networks (millions of
parameters) to learn to achieve the level of human capability.
Figures adopted from Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press
6. FACE RECOGNITION DATASET
Dataset Subjects Images Note
CASIA-WebFace 10,575 494,414 The second largest public dataset available for
face verification and recognition problems.
VGGFace2 9131 3.31 million Images have large variations in pose, age,
illumination, ethnicity and profession.
MS-Celeb-1M 100k 10 million To facilitate the IRC face recognition task
LFW 5749 13,233 For face verification
FaceScrub 530 107,818 One of the largest public face databases.
MegaFace 672,000+ 4.7 million the largest (in number of identities) publicly
available facial recognition dataset
CelebA 10,177 202,599 5 landmark locations, 40 binary attributes
annotations per image.
MultiPie 337 750,000+ Under 15 view points and 19 illumination
conditions
UMDFaces 8,277 367,888 Contains both still images and video frames.
7. DATA ANNOTATION
➢Image-level annotation
➢Object-level annotation
7
{ edge, center } { edge (point),
center (point) }
{ edge (b-box),
center (b-box) }
{ edge (pixel labels),
center (pixel labels) }
very fast fast slow very slow
8. PROBLEMS WITH PREPARING TRAINING DATA
➢Require large amount of training data of wide varieties
➢Imbalanced data
➢Usually domain specific
➢Intensive human labeling cost
➢Restricted to supervised learning
8
10. DATA
AUGMENTATION
State-of-the-art neural networks typically have parameters in
the order of millions. The number of parameters needed
is proportional to the complexity of the task your model has
to perform.
The size of data samples for training the model should be
proportional to the number of parameters of the model.
10
12. Data Augmentation for Training Detector of Partially
Occluded Cars
Augmentation strategy:
resize the vehicles in the training
images with 0.1 to 0.9 of the
original size
horizontally flip the images with
0.5 probability.
truncate a vehicle with 0.5
probability by a new window such
that the vehicle contains at least
25% of its appearance visible.
12
14. TRANSFER
LEARNING
14
Transfer learning is a machine learning
method where a model trained for a task is
reused as the starting point for re-training
a model for another task.
It’s popular for some computer vision tasks
that require vast compute and time
resources to train large and deep neural
network models.
This learning process will tend to work if the
features are general, meaning suitable to
both base and target tasks, instead of
specific to the base task.
18. INTRODUCTION
Supervised Learning – Train a model with labeled data only.
Semi-Supervised Learning – Train a model with labeled(R) and
unlabeled(U) data, usually U>>R.
Why semi-supervised learning ?
Collecting data is easy, but collecting labelled data is expensive.
18
Some Labeled Data
Lots of Unlabeled Data
Model
Semi-supervised
Learning
19. PSEUDO LABELING
19
Example:
Cat or Dog Classification
DOG
CAT
Labeled Data
Unlabeled Data
1/5
DOG
CAT
Labeled Data
2/5
TRAIN
Unlabeled Data 3/5
PREDICTPREDICT
DOG
CAT
4/5
TRAIN
5/5
20. DISCUSSION
Collecting labelled data is expensive
More accurate decision boundary with
unlabeled instances
20
Decision
boundary
Labeled
Instances
Decision
boundary
Unlabeled
Instances
• No way to verify the produced labels’
accuracy
→ less trust worthy
Pros Cons
21. GAN FOR IMAGE SYNTHESIS
(Generative Adversarial Network)
21
22. GAN ARCHITECTURE
22
Real
or
Fake
Loss
G
Generator Network
D
Discriminator Network
Fake Images G(z)
Real Images x
min
𝜃 𝑔
max
𝜃 𝑑
ቈ
𝔼 𝑥∼𝑝 𝑑𝑎𝑡𝑎
log 𝐷 𝜃 𝑑
𝑥 + 𝔼 𝑧∼𝑝 𝑧
log ቆ
ቇ
1
− 𝐷 𝜃 𝑑
𝐺 𝜃 𝑔
𝑧
Objective Function
Noise
z
Generative adversarial nets, Goodfellow, J Pouget-Abadie, M Mirza, B Xu, D Warde-Farley, S Ozair, NIPS 2014
23. CONDITIONAL GAN
Conditional GANs learn a mapping from observed image x and random noise vector z,
to y, G : {x, z} → y.
23Image-to-Image Translation with Conditional Adversarial Networks, P. Isola et al., CVPR 2017
24. GAN FOR DATA AUGMENTATION
Removing the need for dataset
collection with GAN-based
image-to-image transformation.
24
Reuse the detection
bounding boxes
25. 25
PRESERVING IMAGE-OBJECTS FOR IMAGE TRANSLATION
Cycle GAN AugGAN
AugGAN: Cross Domain Adaptation with GAN-based Data Augmentation, S. Huang et al., ECCV 2018
26. ACCURACY COMPARISON FOR OBJECT DETECTION
Comparison of detection accuracies for YOLO and Faster RCNN detectors trained with
transformed data by applying dierent GANs trained from SYNTHIA and GTA datasets.
KITTI-D2N-S/G: KITTI Day-to-Night training data transformed by GAN learnt from
SYNTHIA/GTA
SCNT: Self-Collected Nighttime Testing data.
26AugGAN: Cross Domain Adaptation with GAN-based Data Augmentation, S. Huang et al., ECCV 2018
28. WHY TRAINING
DNN WITH
SYNTHETIC DATA?
28
It’s easy to produce data once a synthetic
model/environment has been established.
Accurate label is expensive, sometimes difficult to
get, or labor intensive
Synthetic environment is flexible to be adjusted as
needed.
Synthetic data is a substitute for data that contains
sensitive information
Synthetic data has been popularly used in areas
like medical imaging, autonomous driving, robotic
control/navigation
29. WAFER MAP DEFECT PATTERN CLASSIFICATION
Wafer Map Defect Pattern Classification and Image Retrieval Using Convolutional Neural Network, IEEE Transactions on
Semiconductor Manufacturing, 2018
• Generate synthetic wafer maps through Poisson point process.
• Testing on real wafer maps
Classification accuracy: almost 100%
Image Retrieval Error Rate: 3.7%
Examples of the generated wafer
maps of different classes
29
30. WRENCH DETECTION
Generate synthetic images in virtual environment.
• Randomly adjust the position and texture of each wrench.
• Automatically generate accurate segmentation mask.
Realdata
3D wrench
model
Segmentation
mask
Complete-wrenches-
only mask
Synthetic
environment
Image generated
by Unity
30
31. WRENCH DETECTION AND SEGMENTATION
Photorealism
• Randomly add gaussian noise to generated image.
• Transfer the style of generated images into the style of real data.
Style
Transfer*
Real Data
Images shot in
Unity
Final synthetic
image
Gaussian Noise
*A Closed-form Solution to Photorealistic Image Stylization, ECCV 2018 31
33. WRENCH DETECTION
Average Precision @0.5 IoU of
using different training data for the wrench detection task.
Real data: 392 images, Synthetic data: 1,000 images.
Training Data Mask AP Bbox AP
Only Real 57.1 72.0
Only Synthetic (w/o style transfer) 54.4 60.4
Only Synthetic (w/ style transfer) 59.0 64.7
Real and Synthetic (w/o style transfer) 73.0 76.9
Real and Synthetic (w/ style transfer) 78.2 82.0
Using Mask R-CNN as the wrench detector.
Y.-H. Lee et al., Automatic generation of photorealistic training data for detection of industrial components”, ICIP 2019
34. COMPUTER VISION SERVICES
Office
SharePoint + OneDrive
PPTX + Word Accessibility
Detail
Cognitive Services
Video Indexer
Content Moderator
Custom Vision
Image Classification
Object Detection
OCR
Leading OCR model
Forms + Structured documents
Exposed via Computer Vision API
Computer Vision
Image Tagging
Image Captioning
Adult, Logos, etc
Face
Face Recognition
Face Detection
Attributes
PartnersInvestments
Dynamics
• Retail
• Mixed Reality
Form Recognizer
• Extract text, key-value pairs, and tables
• Customized to your forms, without manual
labeling
37. TRAIN IN THE CLOUD, RUN ANYWHERE
Client Platform Format
iOS CoreML
Android TensorFlow, TensorFlow Lite
Docker,
Azure IOT Edge, Azure
Functions, Azure ML
Linux, Windows, ARM
Windows ONNX
39. CUSTOM VISION SERVICE
• Current:
• Robust object detectors and image classifiers with
fast training speed and advanced training option
• Train on cloud, evaluate on device:
https://customvision.ai -> Android, iOS, ONNX,
Docker or Cloud service
• Real Customer Scenarios:
• Visual alerts from IoT cameras (ie workplace safety,
truck load detection, traffic)
• Product recognition for grocery store check-out
• Object counting
• Social media analysis (logo detection)
• Drone imagery analysis
Product counting
Logo recognition
Pedestrian, cars detection
42. CONCLUSION
➢The performance of the model is heavily dependent on the training data.
➢Sample quantity, annotation quality, and representative variation of the
training data are all very critical to the success of an AI system.
➢Data augmentation is a simple way to increase the data size.
➢Data annotation/labeling efforts can be dramatically reduced semi-
supervised learning.
➢Data synthesis has been proved to be quite successful for training DNN
models in many real-world applications.
➢Domain adaptation is very useful to make sure the distribution of training
data similar to that of real application scenario.
42