Using synthetic data for computer vision model training

Using synthetic data
for computer vision
model training
WEBINAR
DECEMBER 9,
2021
Alex Thaman
Senior Manager,
Computer Vision
Kevin Saito
Senior Manager, AI
Commercialization
Salehe Erfanian Ebadi
Senior ML Developer, AMLR

Agenda
→ Computer vision overview and advantages of synthetic data
→ Applying synthetic data to production systems
→ Synthetic data case studies
→ Unity’s research with synthetic data
→ Synthetic data generators
→ Q&A

Computer vision
overview
&
Advantages of
synthetic data

High-volume, labeled data
is critical to efficiently train a computer vision model

Training computer vision models on
real-world data has been the answer, but...
It’s time
consuming
It’s biased
and
inefficient
It’s
expensive
It’s not always
privacy-
compliant.
Computer vision overview and advantages of synthetic data

* 70% of time is spent on data collection, labeling and annotation
Typical computer vision workflow
Acquire real
world images
Label and
annotate images
Train
CV model
Evaluate
CV model
Deploy
CV model
Iteration
* *

Solving challenges with ...
Data
collection
Data
labelling
● Insufficient data for their
project due to
non-availability of data
● Privacy and compliance
hindering data collection
● Human labeling is costly,
time consuming and error
prone
● Bias/errors in collected data
as the collected data
represents only a subset of
the population

Object detection
Semantic
segmentation
Instance segmentation
Panoptic
segmentation
Cost of labeling increases with complexity
Input Labels

Situations with lots of assets to label or
background labeling is required
When variational differences are subtle
Impracticality of real-world data in many situations
When the situation occurs very
infrequently or is impractical to capture

Domain randomization
Vary features of your dataset to make your model more
robust
Computer vision overview and and advantages of synthetic data
→ Lighting
→ Background
→ Object orientation
→ Distractor objects

Complex orientations Complex configurations Complex lighting
Green = Accurate detections // Red = Missed detections
Real only
Synthetic
+ real
20-30% more accurate
detections under
different conditions
for this use case
Performance improvements with synthetic data

Applying synthetic
data to production
systems

Bringing AI to production
P R O B L E M
“Quality of Service” - How can
I be sure that my system
works well, and continues to
work well, in the real world?
- Pre-production: Development /
Production data mismatch,
edge cases, selection bias
- Post-production: Model Drift,
Survivorship Bias
Applying synthetic data to production systems
S O L U T I O N
Model generalization with
synthetic data via domain
randomization
- Synthetic data solution! =
Real world solution
- We want to leverage the
programmability of synthetic
data as a strength

Why does it work?
- Domain randomization
- Perturbations to the environment do not have to be realistic, but merely show
variation along dimensions that also vary in the real world
Intervention Design for Effective Sim2Real Transfer -
https://arxiv.org/pdf/2012.02055.pdf
- Focuses on building “Domain Invariance” - if backgrounds should not matter for
detecting objects, teach the model that the background does not matter.
- Well-known research on domain randomization
- Domain Randomization for Transferring Deep Neural Networks from Simulation to the
Real World (OpenAI) - https://arxiv.org/pdf/1703.06907.pdf
- Structured Domain Randomization: Bridging the Reality Gap by Context-Aware
Synthetic Data (nVidia) - https://arxiv.org/pdf/1810.10093.pdf
- An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for
Object Instance Detection (Google Cloud AI) - https://arxiv.org/pdf/1902.09967.pdf
- Large sets of highly-varied synthetic data + small sets of real world data produce the best
results
Applying synthetic data to production systems

Case studies
Neural Pocket
unity.com/case-study/neural-pocket
Customer problem
As a smart city solutions
provider Neural Pocket needs
scalable ways to train systems
to recognize vehicles, people,
smartphones, and identify
potential security threats.
Resulting objective
Reduce the cycle time and
overall costs for creating
production ready computer
vision models
Cost of the real-world data
Using real world-data Neural
Pocket typically had to do 30
training cycles which costs
$60K–150K and took 4-6 months
per project

Case studies
Neural Pocket
Object detection/recognition

Case studies
Object
detection
rate
Training datasets
Real
knives
Real and
synthetic
knives
Real
guns
Real and
synthetic guns
Real
bats
Real and
synthetic bats
27%
87%
80%
100%
80%
100%
Neural Pocket: Results
Object detection rate improvement using
synthetic data

Case studies
Audere
resources.unity.com/ai-ml-content/audere-session
Customer problem
High labor costs to read COVID
tests and report results,
possibility for human error at
scale.
Resulting objective
Build a mobile application that
will read a result from a COVID
test kit to improve reliability and
reduce costs, with minimal
human oversight.
Problem they ran into
COVID kits change frequently
(monthly), test result
appearances vary widely even
within a single kit. Kits are
required to be stored in a bio
safety lab with no windows until
deployment to real world, no
available real training data with
natural lighting or shadows.

Case studies
Audere
Locating the test kit parts
(brand, diagnostic, etc.)
-> OBJECT DETECTION Reading test results as
positive/negative
-> IMAGE CLASSIFICATION

21
Approach
→ Create a digital copy of test kits with an artist
→ Place test kits into Unity with random backgrounds,
lighting, blur, etc.
→ Use procedural material for test kit strips to create high
variations on test results
→
Results
→ Able to match performance of full real world dataset
using 4x less real world data and ~8k synthetic images
→ Synthetic trained models were more resilient to
adverse conditions
→
Audere
Case studies

Unity’s research with
synthetic data

Unity’s research with synthetic data
PeopleSansPeople
People + Sans (Middle English for “without”) + People
A data generator for a few human-centric computer
vision tasks without needing real-world human data.

PeopleSansPeople

What does PeopleSansPeople
provide?
● 28 parameterized simulation-ready 3D human assets
● 39 diverse animation clips
● 21,952 unique clothing textures (from 28 albedos, 28 masks,
and 28 normals)
● Parameterized lighting
● Parameterized camera system
● Natural backgrounds
● Primitive occluders/distractors
● All packaged in a macOS and Linux binary

Which CV tasks does
PeopleSansPeople target?
● Human (2D and 3D bounding box) detection
● Human keypoint detection
● Human semantic/instance segmentation

PeopleSansPeople - Exposed Parameters, Objects

PeopleSansPeople - Exposed Parameters, Rendering

Dataset Statistics and Analysis
● COCO person dataset
○ 64,115 train, 2,693 validation
○ Divided into 100%, 50%, 10%, and 1% subsets
(64115, 32057, 6411, and 641 images)
● Synth dataset
○ 490k train, 10k validation from 3 random seeds
○ macOS and Linux Binaries generate 10k images in ~3 minutes
○ Divided into 100%, 50%, 10%, and 1% subsets
(490k, 245k, 49k, and 4.9k images)
# train # validation # instances
(train)
# instances w/ kpts
(train)
COCO 64,115 2,693 262,465 149,813
Synth 490,000 10,000 >3,070,000 >2,900,000

Synthetic data from PeopleSansPeople covers broad distribution of object placement within the image
COCO Synth JTA

COCO has fewer
boxes per image
than Synth

JTA contains mostly
crowded scenes,
hence many more
boxes per image

JTA has higher
number of small
boxes per image, and
fewer large boxes
per image

Synth has relatively
higher number of
diverse box sizes per
image

In Synth, each
keypoint is doubly as
likely to have an
annotation compared
with COCO

JTA lacks facial
COCO keypoints. We
assume the
head_center
keypoint in JTA
corresponds with
nose.

COCO Synth JTA

COCO
Synth
JTA
● Synth data from PeopleSansPeople has higher diversity of poses.
● Also our pose footprint encompasses those of COCO and JTA.

Model Training
● Detectron2 Keypoint R-CNN R50-FPN model
● We train models from scratch on real and synthetic data
● We train models pre-trained on synthetic data and fine-tune on real
data
● In both cases above, we
○ use different subsets of the data (1%, 10%, 50%, and 100%)
○ perform evaluation on real data

Results
COCO test-dev2017
COCO person-val2017

Results
COCO test-dev2017
COCO person-val2017
● Adding more synthetic pre-training data boosts performance in few-shot and full-shot training, although
due to domain gap zero-shot performance is not good.
● Also adding more fine-tuning real data unsurprisingly increases performance.

Results
Comparison of gains obtained from synthetic pre-training vs. training from scratch and ImageNet weights

Results
Comparison of gains obtained from synthetic pre-training vs. training from scratch and ImageNet weights
For domain-specific tasks, such as human-centric computer vision, domain-specific synthetic pre-training
offers much bigger advantage over ImageNet pre-training. The advantage is even more pronounced when
fine-tuning data is scarce, as is the case with human data, due to ethical, legal, and privacy reasons.

Improved Model Performance - 6,411 COCO images
ImageNet Pre-training
Synthetic Pre-training

Creating a Synthetic Data Generator
- Optimal synthetic data generation does not involve replicating real data
collection strategies
- Start with data diversity
- Then focus on domain adaptation (as needed)
- Define your problem
- What am I predicting?
- What distributions do I know that I need?
- Which variables do I have uncertainty?
- Build a “Data Generator”: Assets + Sensor/Labeler + Randomizers →
Data
- These generators allow experimentation across ranges and
distributions with multiple exposed “data hyperparameters”
- Scale in the cloud

Asset sourcing
- Often need very specific objects for your use case - products, parts, etc. Multiple
approaches to acquiring “digital twins”:
- Artist modeling
- Contract artists to build assets or environments on a contract basis
- Often see costs up to $100 per object
- Building assets for computer vision use cases is relatively new and requirements are not
well understood
- Scanning
- Create a 3D shape and scan all sides of the object
- Works well for rectangular/boxy objects, more difficult for complex shapes
- Typically needs artist cleanup/refinement
- Photogrammetry
- Use a 3D scanner to create a digital twin
- Many tools do not reliably handle reflections and transparency and require artist
cleanup/augmentation
- Procedural/Parameterized models
- Useful for cases where you need a wide variance of a particular semantic category

Asset sourcing – Unity Asset Store
Unity has a large collection of
reusable 3D content and
environments developed by our
community of developers

Randomization – PeopleSansPeople

Randomization
Interior Home Generator General Object Detection

Common questions
- Any question you can think about that involves the words
“photorealism” or “ray tracing”
- Importance depends on your starting point - existing data, target task,
performance goals, training methodology. We have seen significant
performance boosts without it.
- Isn’t data augmentation easier?
- For some tasks it can be, but the sim2real gap still exists
- Example: compositing - difficult to manage occlusion diversity, difficult to have
consistent scene lighting/shadows
- Can we use GANs for domain adaptation?
- Active research area, no clear winners that generalize well yet

Feedback for us, the chance for you to
win a $150 Amazon gift card
https://unitysoftware.co1.qualtrics.com/jfe/form/SV_dfXCjWzS5YOP2w6?&source=on
demand
→ Please click on the link in the chat
window (also shown below):
‒
→ We want to get a better sense of our
audience and the things that might
interest you in future webinar topics

Q&A 2 0 2 1
Alex Thaman
Senior Manager,
Computer Vision
Kevin Saito
Senior Manager, AI
Commercialization
Salehe Erfanian Ebadi
Senior ML Developer, AMLR
unity.com/products/computer-vision

Using synthetic data for computer vision model training

More Related Content

What's hot

Similar to Using synthetic data for computer vision model training

More from Unity Technologies

Recently uploaded

Using synthetic data for computer vision model training