Using synthetic data
for computer vision
model training
WEBINAR
DECEMBER 9,
2021
Alex Thaman
Senior Manager,
Computer Vision
Kevin Saito
Senior Manager, AI
Commercialization
Salehe Erfanian Ebadi
Senior ML Developer, AMLR
Agenda
→ Computer vision overview and advantages of synthetic data
→ Applying synthetic data to production systems
→ Synthetic data case studies
→ Unity’s research with synthetic data
→ Synthetic data generators
→ Q&A
Computer vision
overview
&
Advantages of
synthetic data
High-volume, labeled data
is critical to efficiently train a computer vision model
Training computer vision models on
real-world data has been the answer, but...
It’s time
consuming
It’s biased
and
inefficient
It’s
expensive
It’s not always
privacy-
compliant.
Computer vision overview and advantages of synthetic data
Computer vision overview and advantages of synthetic data
* 70% of time is spent on data collection, labeling and annotation
Typical computer vision workflow
Acquire real
world images
Label and
annotate images
Train
CV model
Evaluate
CV model
Deploy
CV model
Iteration
* *
Solving challenges with ...
Data
collection
Data
labelling
● Insufficient data for their
project due to
non-availability of data
● Privacy and compliance
hindering data collection
● Human labeling is costly,
time consuming and error
prone
● Bias/errors in collected data
as the collected data
represents only a subset of
the population
Computer vision overview and advantages of synthetic data
Computer vision overview and advantages of synthetic data
Object detection
Semantic
segmentation
Instance segmentation
Panoptic
segmentation
Cost of labeling increases with complexity
Input Labels
Situations with lots of assets to label or
background labeling is required
When variational differences are subtle
Impracticality of real-world data in many situations
When the situation occurs very
infrequently or is impractical to capture
Computer vision overview and advantages of synthetic data
Domain randomization
Vary features of your dataset to make your model more
robust
Computer vision overview and and advantages of synthetic data
→ Lighting
→ Background
→ Object orientation
→ Distractor objects
Complex orientations Complex configurations Complex lighting
Green = Accurate detections // Red = Missed detections
Real only
Synthetic
+ real
20-30% more accurate
detections under
different conditions
for this use case
Performance improvements with synthetic data
Applying synthetic
data to production
systems
Bringing AI to production
P R O B L E M
“Quality of Service” - How can
I be sure that my system
works well, and continues to
work well, in the real world?
- Pre-production: Development /
Production data mismatch,
edge cases, selection bias
- Post-production: Model Drift,
Survivorship Bias
Applying synthetic data to production systems
S O L U T I O N
Model generalization with
synthetic data via domain
randomization
- Synthetic data solution! =
Real world solution
- We want to leverage the
programmability of synthetic
data as a strength
Why does it work?
- Domain randomization
- Perturbations to the environment do not have to be realistic, but merely show
variation along dimensions that also vary in the real world
Intervention Design for Effective Sim2Real Transfer -
https://arxiv.org/pdf/2012.02055.pdf
- Focuses on building “Domain Invariance” - if backgrounds should not matter for
detecting objects, teach the model that the background does not matter.
- Well-known research on domain randomization
- Domain Randomization for Transferring Deep Neural Networks from Simulation to the
Real World (OpenAI) - https://arxiv.org/pdf/1703.06907.pdf
- Structured Domain Randomization: Bridging the Reality Gap by Context-Aware
Synthetic Data (nVidia) - https://arxiv.org/pdf/1810.10093.pdf
- An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for
Object Instance Detection (Google Cloud AI) - https://arxiv.org/pdf/1902.09967.pdf
- Large sets of highly-varied synthetic data + small sets of real world data produce the best
results
Applying synthetic data to production systems
Case studies
Case studies
Neural Pocket
unity.com/case-study/neural-pocket
Customer problem
As a smart city solutions
provider Neural Pocket needs
scalable ways to train systems
to recognize vehicles, people,
smartphones, and identify
potential security threats.
Resulting objective
Reduce the cycle time and
overall costs for creating
production ready computer
vision models
Cost of the real-world data
Using real world-data Neural
Pocket typically had to do 30
training cycles which costs
$60K–150K and took 4-6 months
per project
Case studies
Neural Pocket
Object detection/recognition
Case studies
Object
detection
rate
Training datasets
Real
knives
Real and
synthetic
knives
Real
guns
Real and
synthetic guns
Real
bats
Real and
synthetic bats
27%
87%
80%
100%
80%
100%
Neural Pocket: Results
Object detection rate improvement using
synthetic data
Case studies
Audere
resources.unity.com/ai-ml-content/audere-session
Customer problem
High labor costs to read COVID
tests and report results,
possibility for human error at
scale.
Resulting objective
Build a mobile application that
will read a result from a COVID
test kit to improve reliability and
reduce costs, with minimal
human oversight.
Problem they ran into
COVID kits change frequently
(monthly), test result
appearances vary widely even
within a single kit. Kits are
required to be stored in a bio
safety lab with no windows until
deployment to real world, no
available real training data with
natural lighting or shadows.
Case studies
Audere
Locating the test kit parts
(brand, diagnostic, etc.)
-> OBJECT DETECTION Reading test results as
positive/negative
-> IMAGE CLASSIFICATION
21
Approach
→ Create a digital copy of test kits with an artist
→ Place test kits into Unity with random backgrounds,
lighting, blur, etc.
→ Use procedural material for test kit strips to create high
variations on test results
→
Results
→ Able to match performance of full real world dataset
using 4x less real world data and ~8k synthetic images
→ Synthetic trained models were more resilient to
adverse conditions
→
Audere
Case studies
Unity’s research with
synthetic data
Unity’s research with synthetic data
PeopleSansPeople
People + Sans (Middle English for “without”) + People
A data generator for a few human-centric computer
vision tasks without needing real-world human data.
Unity’s research with synthetic data
PeopleSansPeople
Unity’s research with synthetic data
PeopleSansPeople
Unity’s research with synthetic data
Unity’s research with synthetic data
What does PeopleSansPeople
provide?
● 28 parameterized simulation-ready 3D human assets
● 39 diverse animation clips
● 21,952 unique clothing textures (from 28 albedos, 28 masks,
and 28 normals)
● Parameterized lighting
● Parameterized camera system
● Natural backgrounds
● Primitive occluders/distractors
● All packaged in a macOS and Linux binary
Unity’s research with synthetic data
Which CV tasks does
PeopleSansPeople target?
● Human (2D and 3D bounding box) detection
● Human keypoint detection
● Human semantic/instance segmentation
PeopleSansPeople - Exposed Parameters, Objects
PeopleSansPeople - Exposed Parameters, Rendering
Dataset Statistics and Analysis
● COCO person dataset
○ 64,115 train, 2,693 validation
○ Divided into 100%, 50%, 10%, and 1% subsets
(64115, 32057, 6411, and 641 images)
● Synth dataset
○ 490k train, 10k validation from 3 random seeds
○ macOS and Linux Binaries generate 10k images in ~3 minutes
○ Divided into 100%, 50%, 10%, and 1% subsets
(490k, 245k, 49k, and 4.9k images)
# train # validation # instances
(train)
# instances w/ kpts
(train)
COCO 64,115 2,693 262,465 149,813
Synth 490,000 10,000 >3,070,000 >2,900,000
Dataset Statistics and Analysis
Synthetic data from PeopleSansPeople covers broad distribution of object placement within the image
COCO Synth JTA
Dataset Statistics and Analysis
Dataset Statistics and Analysis
Dataset Statistics and Analysis
COCO has fewer
boxes per image
than Synth
Dataset Statistics and Analysis
JTA contains mostly
crowded scenes,
hence many more
boxes per image
Dataset Statistics and Analysis
JTA has higher
number of small
boxes per image, and
fewer large boxes
per image
Dataset Statistics and Analysis
Synth has relatively
higher number of
diverse box sizes per
image
Dataset Statistics and Analysis
In Synth, each
keypoint is doubly as
likely to have an
annotation compared
with COCO
Dataset Statistics and Analysis
JTA lacks facial
COCO keypoints. We
assume the
head_center
keypoint in JTA
corresponds with
nose.
Dataset Statistics and Analysis
COCO Synth JTA
Dataset Statistics and Analysis
COCO Synth JTA
Dataset Statistics and Analysis
COCO
Synth
JTA
● Synth data from PeopleSansPeople has higher diversity of poses.
● Also our pose footprint encompasses those of COCO and JTA.
Model Training
● Detectron2 Keypoint R-CNN R50-FPN model
● We train models from scratch on real and synthetic data
● We train models pre-trained on synthetic data and fine-tune on real
data
● In both cases above, we
○ use different subsets of the data (1%, 10%, 50%, and 100%)
○ perform evaluation on real data
Results
COCO test-dev2017
COCO person-val2017
Results
COCO test-dev2017
COCO person-val2017
Results
COCO test-dev2017
COCO person-val2017
● Adding more synthetic pre-training data boosts performance in few-shot and full-shot training, although
due to domain gap zero-shot performance is not good.
● Also adding more fine-tuning real data unsurprisingly increases performance.
Results
Comparison of gains obtained from synthetic pre-training vs. training from scratch and ImageNet weights
Results
Comparison of gains obtained from synthetic pre-training vs. training from scratch and ImageNet weights
Results
Comparison of gains obtained from synthetic pre-training vs. training from scratch and ImageNet weights
For domain-specific tasks, such as human-centric computer vision, domain-specific synthetic pre-training
offers much bigger advantage over ImageNet pre-training. The advantage is even more pronounced when
fine-tuning data is scarce, as is the case with human data, due to ethical, legal, and privacy reasons.
Improved Model Performance - 6,411 COCO images
ImageNet Pre-training
Synthetic Pre-training
Improved Model Performance - 6,411 COCO images
ImageNet Pre-training
Synthetic Pre-training
Synthetic data
generators
Creating a Synthetic Data Generator
- Optimal synthetic data generation does not involve replicating real data
collection strategies
- Start with data diversity
- Then focus on domain adaptation (as needed)
- Define your problem
- What am I predicting?
- What distributions do I know that I need?
- Which variables do I have uncertainty?
- Build a “Data Generator”: Assets + Sensor/Labeler + Randomizers →
Data
- These generators allow experimentation across ranges and
distributions with multiple exposed “data hyperparameters”
- Scale in the cloud
Digital assets
Asset sourcing
- Often need very specific objects for your use case - products, parts, etc. Multiple
approaches to acquiring “digital twins”:
- Artist modeling
- Contract artists to build assets or environments on a contract basis
- Often see costs up to $100 per object
- Building assets for computer vision use cases is relatively new and requirements are not
well understood
- Scanning
- Create a 3D shape and scan all sides of the object
- Works well for rectangular/boxy objects, more difficult for complex shapes
- Typically needs artist cleanup/refinement
- Photogrammetry
- Use a 3D scanner to create a digital twin
- Many tools do not reliably handle reflections and transparency and require artist
cleanup/augmentation
- Procedural/Parameterized models
- Useful for cases where you need a wide variance of a particular semantic category
Asset sourcing – Unity Asset Store
Unity has a large collection of
reusable 3D content and
environments developed by our
community of developers
Sensors and labels
Randomization – PeopleSansPeople
Randomization – PeopleSansPeople
Randomization
Interior Home Generator General Object Detection
Common questions
- Any question you can think about that involves the words
“photorealism” or “ray tracing”
- Importance depends on your starting point - existing data, target task,
performance goals, training methodology. We have seen significant
performance boosts without it.
- Isn’t data augmentation easier?
- For some tasks it can be, but the sim2real gap still exists
- Example: compositing - difficult to manage occlusion diversity, difficult to have
consistent scene lighting/shadows
- Can we use GANs for domain adaptation?
- Active research area, no clear winners that generalize well yet
Feedback for us, the chance for you to
win a $150 Amazon gift card
https://unitysoftware.co1.qualtrics.com/jfe/form/SV_dfXCjWzS5YOP2w6?&source=on
demand
→ Please click on the link in the chat
window (also shown below):
‒
→ We want to get a better sense of our
audience and the things that might
interest you in future webinar topics
Q&A 2 0 2 1
Alex Thaman
Senior Manager,
Computer Vision
Kevin Saito
Senior Manager, AI
Commercialization
Salehe Erfanian Ebadi
Senior ML Developer, AMLR
unity.com/products/computer-vision

Using synthetic data for computer vision model training

  • 1.
    Using synthetic data forcomputer vision model training WEBINAR DECEMBER 9, 2021 Alex Thaman Senior Manager, Computer Vision Kevin Saito Senior Manager, AI Commercialization Salehe Erfanian Ebadi Senior ML Developer, AMLR
  • 2.
    Agenda → Computer visionoverview and advantages of synthetic data → Applying synthetic data to production systems → Synthetic data case studies → Unity’s research with synthetic data → Synthetic data generators → Q&A
  • 3.
  • 4.
    High-volume, labeled data iscritical to efficiently train a computer vision model
  • 5.
    Training computer visionmodels on real-world data has been the answer, but... It’s time consuming It’s biased and inefficient It’s expensive It’s not always privacy- compliant. Computer vision overview and advantages of synthetic data
  • 6.
    Computer vision overviewand advantages of synthetic data * 70% of time is spent on data collection, labeling and annotation Typical computer vision workflow Acquire real world images Label and annotate images Train CV model Evaluate CV model Deploy CV model Iteration * *
  • 7.
    Solving challenges with... Data collection Data labelling ● Insufficient data for their project due to non-availability of data ● Privacy and compliance hindering data collection ● Human labeling is costly, time consuming and error prone ● Bias/errors in collected data as the collected data represents only a subset of the population Computer vision overview and advantages of synthetic data
  • 8.
    Computer vision overviewand advantages of synthetic data Object detection Semantic segmentation Instance segmentation Panoptic segmentation Cost of labeling increases with complexity Input Labels
  • 9.
    Situations with lotsof assets to label or background labeling is required When variational differences are subtle Impracticality of real-world data in many situations When the situation occurs very infrequently or is impractical to capture Computer vision overview and advantages of synthetic data
  • 10.
    Domain randomization Vary featuresof your dataset to make your model more robust Computer vision overview and and advantages of synthetic data → Lighting → Background → Object orientation → Distractor objects
  • 11.
    Complex orientations Complexconfigurations Complex lighting Green = Accurate detections // Red = Missed detections Real only Synthetic + real 20-30% more accurate detections under different conditions for this use case Performance improvements with synthetic data
  • 12.
    Applying synthetic data toproduction systems
  • 13.
    Bringing AI toproduction P R O B L E M “Quality of Service” - How can I be sure that my system works well, and continues to work well, in the real world? - Pre-production: Development / Production data mismatch, edge cases, selection bias - Post-production: Model Drift, Survivorship Bias Applying synthetic data to production systems S O L U T I O N Model generalization with synthetic data via domain randomization - Synthetic data solution! = Real world solution - We want to leverage the programmability of synthetic data as a strength
  • 14.
    Why does itwork? - Domain randomization - Perturbations to the environment do not have to be realistic, but merely show variation along dimensions that also vary in the real world Intervention Design for Effective Sim2Real Transfer - https://arxiv.org/pdf/2012.02055.pdf - Focuses on building “Domain Invariance” - if backgrounds should not matter for detecting objects, teach the model that the background does not matter. - Well-known research on domain randomization - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World (OpenAI) - https://arxiv.org/pdf/1703.06907.pdf - Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data (nVidia) - https://arxiv.org/pdf/1810.10093.pdf - An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection (Google Cloud AI) - https://arxiv.org/pdf/1902.09967.pdf - Large sets of highly-varied synthetic data + small sets of real world data produce the best results Applying synthetic data to production systems
  • 15.
  • 16.
    Case studies Neural Pocket unity.com/case-study/neural-pocket Customerproblem As a smart city solutions provider Neural Pocket needs scalable ways to train systems to recognize vehicles, people, smartphones, and identify potential security threats. Resulting objective Reduce the cycle time and overall costs for creating production ready computer vision models Cost of the real-world data Using real world-data Neural Pocket typically had to do 30 training cycles which costs $60K–150K and took 4-6 months per project
  • 17.
    Case studies Neural Pocket Objectdetection/recognition
  • 18.
    Case studies Object detection rate Training datasets Real knives Realand synthetic knives Real guns Real and synthetic guns Real bats Real and synthetic bats 27% 87% 80% 100% 80% 100% Neural Pocket: Results Object detection rate improvement using synthetic data
  • 19.
    Case studies Audere resources.unity.com/ai-ml-content/audere-session Customer problem Highlabor costs to read COVID tests and report results, possibility for human error at scale. Resulting objective Build a mobile application that will read a result from a COVID test kit to improve reliability and reduce costs, with minimal human oversight. Problem they ran into COVID kits change frequently (monthly), test result appearances vary widely even within a single kit. Kits are required to be stored in a bio safety lab with no windows until deployment to real world, no available real training data with natural lighting or shadows.
  • 20.
    Case studies Audere Locating thetest kit parts (brand, diagnostic, etc.) -> OBJECT DETECTION Reading test results as positive/negative -> IMAGE CLASSIFICATION
  • 21.
    21 Approach → Create adigital copy of test kits with an artist → Place test kits into Unity with random backgrounds, lighting, blur, etc. → Use procedural material for test kit strips to create high variations on test results → Results → Able to match performance of full real world dataset using 4x less real world data and ~8k synthetic images → Synthetic trained models were more resilient to adverse conditions → Audere Case studies
  • 22.
  • 23.
    Unity’s research withsynthetic data PeopleSansPeople People + Sans (Middle English for “without”) + People A data generator for a few human-centric computer vision tasks without needing real-world human data.
  • 24.
    Unity’s research withsynthetic data PeopleSansPeople
  • 25.
    Unity’s research withsynthetic data PeopleSansPeople
  • 26.
  • 27.
    Unity’s research withsynthetic data What does PeopleSansPeople provide? ● 28 parameterized simulation-ready 3D human assets ● 39 diverse animation clips ● 21,952 unique clothing textures (from 28 albedos, 28 masks, and 28 normals) ● Parameterized lighting ● Parameterized camera system ● Natural backgrounds ● Primitive occluders/distractors ● All packaged in a macOS and Linux binary
  • 28.
    Unity’s research withsynthetic data Which CV tasks does PeopleSansPeople target? ● Human (2D and 3D bounding box) detection ● Human keypoint detection ● Human semantic/instance segmentation
  • 29.
    PeopleSansPeople - ExposedParameters, Objects
  • 30.
    PeopleSansPeople - ExposedParameters, Rendering
  • 31.
    Dataset Statistics andAnalysis ● COCO person dataset ○ 64,115 train, 2,693 validation ○ Divided into 100%, 50%, 10%, and 1% subsets (64115, 32057, 6411, and 641 images) ● Synth dataset ○ 490k train, 10k validation from 3 random seeds ○ macOS and Linux Binaries generate 10k images in ~3 minutes ○ Divided into 100%, 50%, 10%, and 1% subsets (490k, 245k, 49k, and 4.9k images) # train # validation # instances (train) # instances w/ kpts (train) COCO 64,115 2,693 262,465 149,813 Synth 490,000 10,000 >3,070,000 >2,900,000
  • 32.
    Dataset Statistics andAnalysis Synthetic data from PeopleSansPeople covers broad distribution of object placement within the image COCO Synth JTA
  • 33.
  • 34.
  • 35.
    Dataset Statistics andAnalysis COCO has fewer boxes per image than Synth
  • 36.
    Dataset Statistics andAnalysis JTA contains mostly crowded scenes, hence many more boxes per image
  • 37.
    Dataset Statistics andAnalysis JTA has higher number of small boxes per image, and fewer large boxes per image
  • 38.
    Dataset Statistics andAnalysis Synth has relatively higher number of diverse box sizes per image
  • 39.
    Dataset Statistics andAnalysis In Synth, each keypoint is doubly as likely to have an annotation compared with COCO
  • 40.
    Dataset Statistics andAnalysis JTA lacks facial COCO keypoints. We assume the head_center keypoint in JTA corresponds with nose.
  • 41.
    Dataset Statistics andAnalysis COCO Synth JTA
  • 42.
    Dataset Statistics andAnalysis COCO Synth JTA
  • 43.
    Dataset Statistics andAnalysis COCO Synth JTA ● Synth data from PeopleSansPeople has higher diversity of poses. ● Also our pose footprint encompasses those of COCO and JTA.
  • 44.
    Model Training ● Detectron2Keypoint R-CNN R50-FPN model ● We train models from scratch on real and synthetic data ● We train models pre-trained on synthetic data and fine-tune on real data ● In both cases above, we ○ use different subsets of the data (1%, 10%, 50%, and 100%) ○ perform evaluation on real data
  • 45.
  • 46.
  • 47.
    Results COCO test-dev2017 COCO person-val2017 ●Adding more synthetic pre-training data boosts performance in few-shot and full-shot training, although due to domain gap zero-shot performance is not good. ● Also adding more fine-tuning real data unsurprisingly increases performance.
  • 48.
    Results Comparison of gainsobtained from synthetic pre-training vs. training from scratch and ImageNet weights
  • 49.
    Results Comparison of gainsobtained from synthetic pre-training vs. training from scratch and ImageNet weights
  • 50.
    Results Comparison of gainsobtained from synthetic pre-training vs. training from scratch and ImageNet weights For domain-specific tasks, such as human-centric computer vision, domain-specific synthetic pre-training offers much bigger advantage over ImageNet pre-training. The advantage is even more pronounced when fine-tuning data is scarce, as is the case with human data, due to ethical, legal, and privacy reasons.
  • 51.
    Improved Model Performance- 6,411 COCO images ImageNet Pre-training Synthetic Pre-training
  • 52.
    Improved Model Performance- 6,411 COCO images ImageNet Pre-training Synthetic Pre-training
  • 53.
  • 54.
    Creating a SyntheticData Generator - Optimal synthetic data generation does not involve replicating real data collection strategies - Start with data diversity - Then focus on domain adaptation (as needed) - Define your problem - What am I predicting? - What distributions do I know that I need? - Which variables do I have uncertainty? - Build a “Data Generator”: Assets + Sensor/Labeler + Randomizers → Data - These generators allow experimentation across ranges and distributions with multiple exposed “data hyperparameters” - Scale in the cloud
  • 55.
  • 56.
    Asset sourcing - Oftenneed very specific objects for your use case - products, parts, etc. Multiple approaches to acquiring “digital twins”: - Artist modeling - Contract artists to build assets or environments on a contract basis - Often see costs up to $100 per object - Building assets for computer vision use cases is relatively new and requirements are not well understood - Scanning - Create a 3D shape and scan all sides of the object - Works well for rectangular/boxy objects, more difficult for complex shapes - Typically needs artist cleanup/refinement - Photogrammetry - Use a 3D scanner to create a digital twin - Many tools do not reliably handle reflections and transparency and require artist cleanup/augmentation - Procedural/Parameterized models - Useful for cases where you need a wide variance of a particular semantic category
  • 57.
    Asset sourcing –Unity Asset Store Unity has a large collection of reusable 3D content and environments developed by our community of developers
  • 58.
  • 59.
  • 60.
  • 61.
    Randomization Interior Home GeneratorGeneral Object Detection
  • 62.
    Common questions - Anyquestion you can think about that involves the words “photorealism” or “ray tracing” - Importance depends on your starting point - existing data, target task, performance goals, training methodology. We have seen significant performance boosts without it. - Isn’t data augmentation easier? - For some tasks it can be, but the sim2real gap still exists - Example: compositing - difficult to manage occlusion diversity, difficult to have consistent scene lighting/shadows - Can we use GANs for domain adaptation? - Active research area, no clear winners that generalize well yet
  • 63.
    Feedback for us,the chance for you to win a $150 Amazon gift card https://unitysoftware.co1.qualtrics.com/jfe/form/SV_dfXCjWzS5YOP2w6?&source=on demand → Please click on the link in the chat window (also shown below): ‒ → We want to get a better sense of our audience and the things that might interest you in future webinar topics
  • 64.
    Q&A 2 02 1 Alex Thaman Senior Manager, Computer Vision Kevin Saito Senior Manager, AI Commercialization Salehe Erfanian Ebadi Senior ML Developer, AMLR unity.com/products/computer-vision